README.md · GadflyII/GLM-4.7-Flash-NVFP4 at main

GLM-4.7-Flash-NVFP4 / README.md

GadflyII

Update README.md

3cdd7f3 verified about 1 hour ago

preview code

raw

history blame contribute delete

4.58 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	base_model: zai-org/GLM-4.7-Flash
	tags:
	- moe
	- nvfp4
	- quantized
	- vllm
	- glm
	- 30b
	library_name: transformers
	pipeline_tag: text-generation
	---
	# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
	https://github.com/Gadflyii/vllm/tree/main

	# GLM-4.7-Flash NVFP4 (Mixed Precision)

	This is a mixed precision NVFP4 quantization of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

	## Quantization Strategy

	This model was made via custom quantization and calibration (128 samples, 2048 max seq len, neuralmagic/calibration, all 64 experts) scripts based on NVIDIA's approach for DeepSeek-V3. It uses mixed precision to preserve accuracy:

	\| Component \| Precision \| Rationale \|
	\|-----------\|-----------\|-----------\|
	\| MLP Experts \| FP4 (E2M1) \| 64 routed experts, 4 active per token \|
	\| Dense MLP \| FP4 (E2M1) \| First layer dense MLP \|
	\| Attention (MLA) \| BF16 \| Low-rank compressed Q/KV projections are sensitive \|
	\| Norms, Gates, Embeddings \| BF16 \| Standard practice \|

	## Performance

	\| Metric \| BF16 \| Uniform FP4 \| This Model \|
	\|--------\|------\|-------------\|----------------\|
	\| MMLU-Pro \| 24.83% \| 16.84% \| 23.55% \|
	\| Size \| 62.4 GB \| 18.9 GB \| 20.4 GB \|
	\| Compression \| 1x \| 3.3x \| 3.1x \|
	\| Accuracy Loss \| - \| -8.0% \| -1.3% \|


	## Usage

	### Requirements

	- vLLM: 0.14.0+ (for compressed-tensors NVFP4 support)
	- transformers: 5.0.0+ (for `glm4_moe_lite` architecture)
	- GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

	### Installation

	```bash
	pip install vllm>=0.14.0
	pip install git+https://github.com/huggingface/transformers.git
	```

	### Inference with vLLM

	```python
	from vllm import LLM, SamplingParams

	model = LLM(
	"GadflyII/GLM-4.7-Flash-NVFP4",
	tensor_parallel_size=1,
	max_model_len=4096,
	trust_remote_code=True,
	gpu_memory_utilization=0.85,
	)

	params = SamplingParams(temperature=0.7, max_tokens=512)
	outputs = model.generate(["Explain quantum computing in simple terms."], params)
	print(outputs[0].outputs[0].text)
	```

	### Serving with vLLM

	```bash
	vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
	--tensor-parallel-size 1 \
	--max-model-len 4096 \
	--trust-remote-code
	```

	## Model Details

	- Base Model: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
	- Architecture: `Glm4MoeLiteForCausalLM`
	- Parameters: 30B total, 3B active per token (30B-A3B)
	- MoE Configuration: 64 routed experts, 4 active, 1 shared expert
	- Layers: 47
	- Context Length: 202,752 tokens (max)
	- Languages: English, Chinese

	## Quantization Details

	- Format: compressed-tensors (NVFP4)
	- Block Size: 16
	- Scale Format: FP8 (E4M3)
	- Calibration: 128 samples from neuralmagic/calibration dataset
	- Full Expert Calibration: All 64 experts calibrated per sample

	## Evaluation

	### MMLU-Pro Overall Results

	\| Model \| Accuracy \| Correct \| Total \|
	\|-------\|----------\|---------\|-------\|
	\| BF16 (baseline) \| 24.83% \| 2988 \| 12032 \|
	\| NVFP4 (this model) \| 23.55% \| 2834 \| 12032 \|
	\| Difference \| -1.28% \| -154 \| - \|

	### MMLU-Pro by Category

	\| Category \| BF16 \| NVFP4 \| Difference \|
	\|----------\|------\|-------\|------------\|
	\| Social Sciences \| 32.70% \| 31.43% \| -1.27% \|
	\| Other \| 31.57% \| 30.08% \| -1.49% \|
	\| Humanities \| 23.78% \| 22.56% \| -1.22% \|
	\| STEM \| 19.94% \| 18.70% \| -1.24% \|

	### MMLU-Pro by Subject

	\| Subject \| BF16 \| NVFP4 \| Difference \|
	\|---------\|------\|-------\|------------\|
	\| Biology \| 50.35% \| 47.42% \| -2.93% \|
	\| Psychology \| 44.99% \| 42.48% \| -2.51% \|
	\| Economics \| 36.37% \| 34.48% \| -1.89% \|
	\| Health \| 35.21% \| 34.84% \| -0.37% \|
	\| History \| 33.60% \| 30.71% \| -2.89% \|
	\| Philosophy \| 31.46% \| 30.06% \| -1.40% \|
	\| Other \| 28.35% \| 25.87% \| -2.48% \|
	\| Computer Science \| 26.10% \| 21.46% \| -4.64% \|
	\| Business \| 16.35% \| 16.98% \| +0.63% \|
	\| Law \| 16.89% \| 16.35% \| -0.54% \|
	\| Engineering \| 16.00% \| 14.04% \| -1.96% \|
	\| Physics \| 15.32% \| 14.70% \| -0.62% \|
	\| Math \| 14.06% \| 14.29% \| +0.23% \|
	\| Chemistry \| 14.13% \| 13.34% \| -0.79% \|

	## Citation

	If you use this model, please cite the original GLM-4.7-Flash:

	```bibtex
	@misc{glm4flash2025,
	title={GLM-4.7-Flash},
	author={Zhipu AI},
	year={2025},
	howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
	}
	```

	## License

	This model inherits the Apache 2.0 license from the base model.