README.md · GadflyII/GLM-4.7-Flash-MXFP4 at main

GLM-4.7-Flash-MXFP4 / README.md

GadflyII

Update README.md

4f9ffce verified about 1 month ago

preview code

raw

history blame contribute delete

6.74 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	base_model: zai-org/GLM-4.7-Flash
	tags:
	- moe
	- mxfp4
	- quantized
	- vllm
	- glm
	- 30b
	library_name: transformers
	pipeline_tag: text-generation
	---
	# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
	# Note: If you are running this MXFP4 model on SM120 GPU's, you also will need to use my fork until PR into upstream is merged, however it is significantly slower than NVFP4.

	https://github.com/Gadflyii/vllm/tree/main

	# GLM-4.7-Flash MXFP4

	This is a MXFP4 quantization of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

	## Quantization Strategy

	This model uses MXFP4 (Microscaling FP4) format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts.

	\| Component \| Precision \| Rationale \|
	\|-----------\|-----------\|-----------\|
	\| MLP Experts (gate_up, down) \| MXFP4 (E2M1) \| 64 routed experts, 4 active per token \|
	\| Attention (MLA) \| BF16 \| Low-rank compressed Q/KV projections are sensitive \|
	\| Dense MLP \| BF16 \| First layer dense MLP \|
	\| Norms, Gates, Embeddings \| BF16 \| Standard practice \|

	### MXFP4 vs NVFP4

	\| Property \| MXFP4 \| NVFP4 \|
	\|----------\|-------\|-------\|
	\| Weight Format \| E2M1 (4-bit) \| E2M1 (4-bit) \|
	\| Scale Format \| E8M0 (power-of-2) \| FP8 (E4M3) \|
	\| Block Size \| 32 \| 16 \|
	\| Backend \| Marlin \| FlashInfer/Cutlass \|

	## Performance

	\| Metric \| BF16 \| This Model \|
	\|--------\|------\|----------------\|
	\| MMLU-Pro \| 24.83% \| 25.86% \|
	\| Size \| 62.4 GB \| 20.8 GB \|
	\| Compression \| 1x \| 3.0x \|
	\| Accuracy Δ \| - \| +1.03% \|
	\| Throughput \| 92.4 q/s \| 138.7 q/s \|

	## Usage

	### Requirements

	- vLLM: 0.14.0+ (for MXFP4 Marlin backend support)
	- transformers: 5.0.0+ (for `glm4_moe_lite` architecture)
	- GPU: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell)

	### Installation

	```bash
	pip install vllm>=0.14.0
	pip install git+https://github.com/huggingface/transformers.git
	```

	### Inference with vLLM

	```python
	import os
	os.environ["VLLM_MXFP4_USE_MARLIN"] = "1"

	from vllm import LLM, SamplingParams

	model = LLM(
	"GadflyII/GLM-4.7-Flash-MXFP4",
	tensor_parallel_size=1,
	max_model_len=65536, # Can go up to 202K with sufficient VRAM
	trust_remote_code=True,
	gpu_memory_utilization=0.90,
	)

	# Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs
	params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048)
	outputs = model.generate(["Explain quantum computing in simple terms."], params)
	print(outputs[0].outputs[0].text)
	```

	### Serving with vLLM

	```bash
	VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \
	--tensor-parallel-size 1 \
	--max-model-len 65536 \
	--trust-remote-code \
	--gpu-memory-utilization 0.90
	```

	### Chat Completions API

	```python
	import requests

	payload = {
	"model": "GadflyII/GLM-4.7-Flash-MXFP4",
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 1024,
	"temperature": 0.7,
	# Disable thinking mode for direct responses:
	"chat_template_kwargs": {"enable_thinking": False}
	# Or enable thinking for reasoning tasks:
	# "chat_template_kwargs": {"enable_thinking": True}
	}
	response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
	print(response.json()["choices"][0]["message"]["content"])
	```

	## Important Usage Notes

	### Sampling Parameters

	\| Parameter \| Recommended \| Avoid \| Reason \|
	\|-----------\|-------------\|-------\|--------\|
	\| `temperature` \| 0.3-0.7 \| - \| Standard range \|
	\| `top_p` \| 0.9-0.95 \| - \| Standard range \|
	\| `repetition_penalty` \| None or ≤1.05 \| >1.05 \| High values cause word-salad at long outputs \|
	\| `max_tokens` \| Up to 10,000+ \| - \| Model handles long generation well \|

	### Thinking Mode

	This model supports a "thinking" mode where it shows its reasoning process:

	- `enable_thinking: True` - Model outputs its reasoning process before the answer (good for math, coding, complex reasoning)
	- `enable_thinking: False` - Model outputs the answer directly (good for chat, simple Q&A)

	The model thinks in English when given English prompts.

	## Model Details

	- Base Model: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
	- Architecture: `Glm4MoeLiteForCausalLM`
	- Parameters: 30B total, 3B active per token (30B-A3B)
	- MoE Configuration: 64 routed experts, 4 active, 1 shared expert
	- Layers: 47
	- Context Length: 202,752 tokens (max)
	- Languages: English, Chinese

	## Quantization Details

	- Format: MXFP4 (Microscaling FP4)
	- Weight Format: E2M1 (4-bit floating point, range ±6.0)
	- Scale Format: E8M0 (8-bit power-of-2 scales)
	- Block Size: 32
	- Calibration: 128 samples from neuralmagic/calibration dataset

	## Evaluation

	### MMLU-Pro Overall Results

	\| Model \| Accuracy \| Correct \| Total \| Throughput \|
	\|-------\|----------\|---------\|-------\|------------\|
	\| BF16 (baseline) \| 24.83% \| 2988 \| 12032 \| 92.4 q/s \|
	\| MXFP4 (this model) \| 25.86% \| 3112 \| 12032 \| 138.7 q/s \|
	\| Difference \| +1.03% \| +124 \| - \| +50% \|

	### MMLU-Pro by Category

	\| Category \| BF16 \| MXFP4 \| Δ \|
	\|----------\|------\|-------\|---\|
	\| Social Sciences \| 32.70% \| 34.68% \| +1.98% \|
	\| Other \| 31.57% \| 32.84% \| +1.27% \|
	\| Humanities \| 23.78% \| 23.78% \| 0.00% \|
	\| STEM \| 19.94% \| 20.86% \| +0.92% \|

	### MMLU-Pro by Subject (All 14 Subjects)

	\| Subject \| BF16 \| MXFP4 \| Δ \| Questions \|
	\|---------\|------\|-------\|---\|-----------\|
	\| Biology \| 50.35% \| 52.16% \| +1.81% \| 717 \|
	\| Psychology \| 44.99% \| 47.74% \| +2.75% \| 798 \|
	\| Economics \| 36.37% \| 38.27% \| +1.90% \| 844 \|
	\| Health \| 35.21% \| 36.31% \| +1.10% \| 818 \|
	\| History \| 33.60% \| 32.28% \| -1.32% \| 381 \|
	\| Philosophy \| 31.46% \| 31.86% \| +0.40% \| 499 \|
	\| Other \| 28.35% \| 29.76% \| +1.41% \| 924 \|
	\| Computer Science \| 26.10% \| 25.85% \| -0.25% \| 410 \|
	\| Business \| 16.35% \| 17.62% \| +1.27% \| 789 \|
	\| Law \| 16.89% \| 17.17% \| +0.28% \| 1101 \|
	\| Physics \| 15.32% \| 16.17% \| +0.85% \| 1299 \|
	\| Engineering \| 16.00% \| 15.58% \| -0.42% \| 969 \|
	\| Math \| 14.06% \| 15.54% \| +1.48% \| 1351 \|
	\| Chemistry \| 14.13% \| 15.46% \| +1.33% \| 1132 \|


	## Citation

	If you use this model, please cite the original GLM-4.7-Flash:

	```bibtex
	@misc{glm4flash2025,
	title={GLM-4.7-Flash},
	author={Zhipu AI},
	year={2025},
	howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
	}
	```

	## License

	This model inherits the Apache 2.0 license from the base model.