GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP in BF16)
This is a mixed precision NVFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves MTP (Multi-Token Prediction) layers in BF16 for speculative decoding compatibility.
What's Different from GLM-4.7-Flash-NVFP4?
Feature
GLM-4.7-Flash-NVFP4
This Model
MTP Layers
NVFP4
BF16
Calibration Samples
128
512
Calibration Seq Length
2048
4096
MMLU-Pro Accuracy
23.56%
23.91%
Quantization Strategy
This model uses mixed precision to preserve accuracy and MTP functionality:
Component
Precision
Rationale
MLP Experts
FP4 (E2M1)
64 routed experts, 4 active per token
Dense MLP
FP4 (E2M1)
First layer dense MLP
Attention (MLA)
BF16
Low-rank compressed Q/KV projections are sensitive
MTP Layers
BF16
eh_proj, shared_head.head for speculative decoding
Norms, Gates, Embeddings
BF16
Standard practice
Performance
Metric
BF16
NVFP4
This Model
MMLU-Pro
24.83%
23.56%
23.91%
Size
62.4 GB
20.4 GB
20.9 GB
Compression
1x
3.1x
3.0x
Accuracy Loss
-
-1.27%
-0.92%
MTP Acceptance Rate
Model
Acceptance Rate
Mean Accepted Length
BF16 (baseline)
60%
1.60
This Model
63%
1.63
MTP quality is preserved (actually slightly improved) after quantization.
MTP Performance Note
MTP speculative decoding currently shows overhead rather than speedup due to missing torch.compile support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream.