README.md · GadflyII/GLM-4.7-Flash-MTP-NVFP4 at main

GLM-4.7-Flash-MTP-NVFP4 / README.md

GadflyII

Update README.md

788261a verified about 1 month ago

preview code

raw

history blame contribute delete

6.44 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	base_model: zai-org/GLM-4.7-Flash
	tags:
	- moe
	- nvfp4
	- quantized
	- vllm
	- glm
	- 30b
	- mtp
	- speculative-decoding
	library_name: transformers
	pipeline_tag: text-generation
	---
	# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
	https://github.com/Gadflyii/vllm/tree/main

	# GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP in BF16)

	This is a mixed precision NVFP4 quantization of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves MTP (Multi-Token Prediction) layers in BF16 for speculative decoding compatibility.

	## What's Different from GLM-4.7-Flash-NVFP4?

	\| Feature \| GLM-4.7-Flash-NVFP4 \| This Model \|
	\|---------\|---------------------\|----------------\|
	\| MTP Layers \| NVFP4 \| BF16 \|
	\| Calibration Samples \| 128 \| 512 \|
	\| Calibration Seq Length \| 2048 \| 4096 \|
	\| MMLU-Pro Accuracy \| 23.56% \| 23.91% \|

	## Quantization Strategy

	This model uses mixed precision to preserve accuracy and MTP functionality:

	\| Component \| Precision \| Rationale \|
	\|-----------\|-----------\|-----------\|
	\| MLP Experts \| FP4 (E2M1) \| 64 routed experts, 4 active per token \|
	\| Dense MLP \| FP4 (E2M1) \| First layer dense MLP \|
	\| Attention (MLA) \| BF16 \| Low-rank compressed Q/KV projections are sensitive \|
	\| MTP Layers \| BF16 \| `eh_proj`, `shared_head.head` for speculative decoding \|
	\| Norms, Gates, Embeddings \| BF16 \| Standard practice \|

	## Performance

	\| Metric \| BF16 \| NVFP4 \| This Model \|
	\|--------\|------\|----------\|----------------\|
	\| MMLU-Pro \| 24.83% \| 23.56% \| 23.91% \|
	\| Size \| 62.4 GB \| 20.4 GB \| 20.9 GB \|
	\| Compression \| 1x \| 3.1x \| 3.0x \|
	\| Accuracy Loss \| - \| -1.27% \| -0.92% \|

	### MTP Acceptance Rate

	\| Model \| Acceptance Rate \| Mean Accepted Length \|
	\|-------\|-----------------\|----------------------\|
	\| BF16 (baseline) \| 60% \| 1.60 \|
	\| This Model \| 63% \| 1.63 \|

	MTP quality is preserved (actually slightly improved) after quantization.

	### MTP Performance Note

	MTP speculative decoding currently shows overhead rather than speedup due to missing `torch.compile` support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream.

	\| Configuration \| Tokens/sec \|
	\|---------------\|------------\|
	\| Without MTP \| 78.1 tok/s \|
	\| With MTP (1 token) \| 64.7 tok/s \|
	\| With MTP (2 tokens) \| 56.8 tok/s \|
	\| With MTP (4 tokens) \| 44.5 tok/s \|

	## Usage

	### Requirements

	- vLLM: 0.8.0+ (for compressed-tensors NVFP4 support)
	- transformers: 5.0.0+ (for `glm4_moe_lite` architecture)
	- GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

	### Installation

	```bash
	pip install vllm>=0.8.0
	pip install git+https://github.com/huggingface/transformers.git
	```

	### Inference with vLLM (Recommended)

	```python
	from vllm import LLM, SamplingParams

	model = LLM(
	"GadflyII/GLM-4.7-Flash-MTP-NVFP4",
	tensor_parallel_size=1,
	max_model_len=4096,
	trust_remote_code=True,
	gpu_memory_utilization=0.90,
	)

	params = SamplingParams(temperature=0.7, max_tokens=512)
	outputs = model.generate(["Explain quantum computing in simple terms."], params)
	print(outputs[0].outputs[0].text)
	```

	### Serving with vLLM

	```bash
	# Standard serving (recommended for performance)
	VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
	--tensor-parallel-size 1 \
	--max-model-len 4096 \
	--trust-remote-code \
	--gpu-memory-utilization 0.90

	# With MTP speculative decoding (experimental)
	VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
	--tensor-parallel-size 1 \
	--max-model-len 4096 \
	--trust-remote-code \
	--gpu-memory-utilization 0.90 \
	--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
	```

	## Model Details

	- Base Model: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
	- Architecture: `Glm4MoeLiteForCausalLM`
	- Parameters: 30B total, 3B active per token (30B-A3B)
	- MoE Configuration: 64 routed experts, 4 active, 1 shared expert
	- Layers: 47 (with 1 MTP layer)
	- Context Length: 202,752 tokens (max)
	- Languages: English, Chinese

	## Quantization Details

	- Format: compressed-tensors (NVFP4)
	- Block Size: 16
	- Scale Format: FP8 (E4M3)
	- Calibration: 512 samples from wikitext dataset
	- Calibration Sequence Length: 4096
	- Full Expert Calibration: All 64 experts calibrated per sample

	### Tensors by Precision

	\| Precision \| Count \| Description \|
	\|-----------\|-------\|-------------\|
	\| NVFP4 \| 9,168 \| MLP/FFN weights \|
	\| BF16 \| 240 \| Attention weights (MLA) \|
	\| BF16 \| 2 \| MTP layers (eh_proj, shared_head.head) \|

	## Evaluation

	### MMLU-Pro Overall Results

	\| Model \| Accuracy \| Correct \| Total \|
	\|-------\|----------\|---------\|-------\|
	\| BF16 (baseline) \| 24.83% \| 2988 \| 12032 \|
	\| NVFP4-v1 \| 23.56% \| 2835 \| 12032 \|
	\| This Model \| 23.91% \| 2877 \| 12032 \|

	### MMLU-Pro by Category

	\| Category \| BF16 \| This Model \| Difference \|
	\|----------\|------\|------------\|------------\|
	\| Social Sciences \| 32.70% \| 31.26% \| -1.44% \|
	\| Other \| 31.57% \| 29.85% \| -1.72% \|
	\| Humanities \| 23.78% \| 22.82% \| -0.96% \|
	\| STEM \| 19.94% \| 19.48% \| -0.46% \|

	### MMLU-Pro by Subject

	\| Subject \| BF16 \| This Model \| Difference \|
	\|---------\|------\|------------\|------------\|
	\| Biology \| 50.35% \| 48.12% \| -2.23% \|
	\| Psychology \| 44.99% \| 41.23% \| -3.76% \|
	\| History \| 33.60% \| 34.12% \| +0.52% \|
	\| Health \| 35.21% \| 34.11% \| -1.10% \|
	\| Economics \| 36.37% \| 33.06% \| -3.31% \|
	\| Philosophy \| 31.46% \| 29.26% \| -2.20% \|
	\| Other \| 28.35% \| 26.08% \| -2.27% \|
	\| Computer Science \| 26.10% \| 21.95% \| -4.15% \|
	\| Business \| 16.35% \| 19.26% \| +2.91% \|
	\| Law \| 16.89% \| 15.99% \| -0.90% \|
	\| Math \| 14.06% \| 14.73% \| +0.67% \|
	\| Physics \| 15.32% \| 15.24% \| -0.08% \|
	\| Engineering \| 16.00% \| 14.96% \| -1.04% \|
	\| Chemistry \| 14.13% \| 14.84% \| +0.71% \|

	## Citation

	If you use this model, please cite the original GLM-4.7-Flash:

	```bibtex
	@misc{glm4flash2025,
	title={GLM-4.7-Flash},
	author={Zhipu AI},
	year={2025},
	howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
	}
	```

	## License

	This model inherits the Apache 2.0 license from the base model.