Model Description
GLM-5-NVFP4-MTP is an NVFP4-quantized version of zai-org/GLM-5 with Multi-Token Prediction (MTP) weights restored, enabling speculative decoding with vLLM and SGLang.
This is based on lukealonso/GLM-5-NVFP4 โ a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA).
Quantized directly from the full BF16 checkpoint (zai-org/GLM-5), not the FP8 release, to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.
MTP Layer Addition
The original lukealonso/GLM-5-NVFP4 declares "num_nextn_predict_layers": 1 in its config but ships without the MTP layer weights (layer 78). This repo fixes that by extracting the MTP layer directly from the original BF16 model (zai-org/GLM-5).
What was done:
- Extracted all 791 tensors for
model.layers.78.*from the BF16 model (shards 271โ274 of 282) and saved them asmtp.safetensors(~19 GB, full BF16 precision) - Updated
model.safetensors.index.jsonto include the 791 layer 78 โmtp.safetensorsmappings - Added
model.layers.78.*glob patterns toquantization_config.ignorein bothconfig.jsonandhf_quant_config.jsonso the NVFP4 dequantizer skips the MTP layer - All other weights (layers 0โ77, embeddings, lm_head) remain unchanged from the original NVFP4 quantization
The MTP layer contains:
- Special MTP components:
eh_proj,enorm,hnorm,shared_head.norm - A complete transformer block: self-attention + MoE MLP (256 experts)
What's quantized
Only the MoE expert MLP layers (gate, up, and down projections) for layers 0โ77 are quantized to NVFP4. Attention layers are left in BF16. The MTP layer (layer 78) is entirely in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.
Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.
Calibration dataset
Three calibration passes were run:
- Coding pass โ Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
- Broad pass โ Large-scale diverse samples drawn from WildChat and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
- Deep pass โ Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.
Merged via element-wise max across all calibration runs.
How to Run
NVFP4 requires Blackwell GPUs (RTX 5090, RTX Pro 6000, B100, B200, etc.). Even quantized, this is a huge model โ tested on 8x RTX Pro 6000 Blackwell (96 GB each, 768 GB total).
If you experience NCCL hangs with P2P, make sure you have iommu=pt (and amd_iommu=pt on AMD platforms) in your kernel command line.
SGLang (with MTP speculative decoding)
export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=PHB
export NCCL_ALLOC_P2P_NET_LL_BUFFERS=1
export NCCL_MIN_NCHANNELS=8
export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1
python3 -m sglang.launch_server \
--model festr2/GLM-5-NVFP4-MTP \
--served-model-name glm-5 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--trust-remote-code \
--tp 8 \
--mem-fraction-static 0.95 \
--max-running-requests 8 \
--kv-cache-dtype fp8_e4m3 \
--quantization modelopt_fp4 \
--attention-backend flashinfer \
--moe-runner-backend flashinfer_cutlass \
--disable-custom-all-reduce \
--enable-flashinfer-allreduce-fusion \
--speculative-algorithm mtp \
--num-speculative-tokens 1 \
--host 0.0.0.0 \
--port 8000
vLLM (with MTP speculative decoding)
vllm serve festr2/GLM-5-NVFP4-MTP \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Credits
- Original model: zai-org/GLM-5
- NVFP4 quantization: lukealonso/GLM-5-NVFP4
- MTP layer restoration: extracted from the original BF16 weights
- Downloads last month
- 2,842
Model tree for festr2/GLM-5-NVFP4-MTP
Base model
zai-org/GLM-5