Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
https://github.com/Gadflyii/vllm/tree/main
GLM-4.7-Flash MXFP4
This is a MXFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.
Quantization Strategy
This model uses MXFP4 (Microscaling FP4) format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts.
| Component | Precision | Rationale |
|---|---|---|
| MLP Experts (gate_up, down) | MXFP4 (E2M1) | 64 routed experts, 4 active per token |
| Attention (MLA) | BF16 | Low-rank compressed Q/KV projections are sensitive |
| Dense MLP | BF16 | First layer dense MLP |
| Norms, Gates, Embeddings | BF16 | Standard practice |
MXFP4 vs NVFP4
| Property | MXFP4 | NVFP4 |
|---|---|---|
| Weight Format | E2M1 (4-bit) | E2M1 (4-bit) |
| Scale Format | E8M0 (power-of-2) | FP8 (E4M3) |
| Block Size | 32 | 16 |
| Backend | Marlin | FlashInfer/Cutlass |
Performance
| Metric | BF16 | This Model |
|---|---|---|
| MMLU-Pro | 24.83% | 25.86% |
| Size | 62.4 GB | 20.8 GB |
| Compression | 1x | 3.0x |
| Accuracy Δ | - | +1.03% |
| Throughput | 92.4 q/s | 138.7 q/s |
Usage
Requirements
- vLLM: 0.14.0+ (for MXFP4 Marlin backend support)
- transformers: 5.0.0+ (for
glm4_moe_litearchitecture) - GPU: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell)
Installation
pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git
Inference with vLLM
import os
os.environ["VLLM_MXFP4_USE_MARLIN"] = "1"
from vllm import LLM, SamplingParams
model = LLM(
"GadflyII/GLM-4.7-Flash-MXFP4",
tensor_parallel_size=1,
max_model_len=65536, # Can go up to 202K with sufficient VRAM
trust_remote_code=True,
gpu_memory_utilization=0.90,
)
# Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
Serving with vLLM
VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--trust-remote-code \
--gpu-memory-utilization 0.90
Chat Completions API
import requests
payload = {
"model": "GadflyII/GLM-4.7-Flash-MXFP4",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 1024,
"temperature": 0.7,
# Disable thinking mode for direct responses:
"chat_template_kwargs": {"enable_thinking": False}
# Or enable thinking for reasoning tasks:
# "chat_template_kwargs": {"enable_thinking": True}
}
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])
Important Usage Notes
Sampling Parameters
| Parameter | Recommended | Avoid | Reason |
|---|---|---|---|
temperature |
0.3-0.7 | - | Standard range |
top_p |
0.9-0.95 | - | Standard range |
repetition_penalty |
None or ≤1.05 | >1.05 | High values cause word-salad at long outputs |
max_tokens |
Up to 10,000+ | - | Model handles long generation well |
Thinking Mode
This model supports a "thinking" mode where it shows its reasoning process:
enable_thinking: True- Model outputs its reasoning process before the answer (good for math, coding, complex reasoning)enable_thinking: False- Model outputs the answer directly (good for chat, simple Q&A)
The model thinks in English when given English prompts.
Model Details
- Base Model: zai-org/GLM-4.7-Flash
- Architecture:
Glm4MoeLiteForCausalLM - Parameters: 30B total, 3B active per token (30B-A3B)
- MoE Configuration: 64 routed experts, 4 active, 1 shared expert
- Layers: 47
- Context Length: 202,752 tokens (max)
- Languages: English, Chinese
Quantization Details
- Format: MXFP4 (Microscaling FP4)
- Weight Format: E2M1 (4-bit floating point, range ±6.0)
- Scale Format: E8M0 (8-bit power-of-2 scales)
- Block Size: 32
- Calibration: 128 samples from neuralmagic/calibration dataset
Evaluation
MMLU-Pro Overall Results
| Model | Accuracy | Correct | Total | Throughput |
|---|---|---|---|---|
| BF16 (baseline) | 24.83% | 2988 | 12032 | 92.4 q/s |
| MXFP4 (this model) | 25.86% | 3112 | 12032 | 138.7 q/s |
| Difference | +1.03% | +124 | - | +50% |
MMLU-Pro by Category
| Category | BF16 | MXFP4 | Δ |
|---|---|---|---|
| Social Sciences | 32.70% | 34.68% | +1.98% |
| Other | 31.57% | 32.84% | +1.27% |
| Humanities | 23.78% | 23.78% | 0.00% |
| STEM | 19.94% | 20.86% | +0.92% |
MMLU-Pro by Subject (All 14 Subjects)
| Subject | BF16 | MXFP4 | Δ | Questions |
|---|---|---|---|---|
| Biology | 50.35% | 52.16% | +1.81% | 717 |
| Psychology | 44.99% | 47.74% | +2.75% | 798 |
| Economics | 36.37% | 38.27% | +1.90% | 844 |
| Health | 35.21% | 36.31% | +1.10% | 818 |
| History | 33.60% | 32.28% | -1.32% | 381 |
| Philosophy | 31.46% | 31.86% | +0.40% | 499 |
| Other | 28.35% | 29.76% | +1.41% | 924 |
| Computer Science | 26.10% | 25.85% | -0.25% | 410 |
| Business | 16.35% | 17.62% | +1.27% | 789 |
| Law | 16.89% | 17.17% | +0.28% | 1101 |
| Physics | 15.32% | 16.17% | +0.85% | 1299 |
| Engineering | 16.00% | 15.58% | -0.42% | 969 |
| Math | 14.06% | 15.54% | +1.48% | 1351 |
| Chemistry | 14.13% | 15.46% | +1.33% | 1132 |
Citation
If you use this model, please cite the original GLM-4.7-Flash:
@misc{glm4flash2025,
title={GLM-4.7-Flash},
author={Zhipu AI},
year={2025},
howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}
License
This model inherits the Apache 2.0 license from the base model.
- Downloads last month
- 59
Model tree for GadflyII/GLM-4.7-Flash-MXFP4
Base model
zai-org/GLM-4.7-Flash