Model Description

GLM-5-NVFP4-MTP is an NVFP4-quantized version of zai-org/GLM-5 with Multi-Token Prediction (MTP) weights restored, enabling speculative decoding with vLLM and SGLang.

This is based on lukealonso/GLM-5-NVFP4 โ€” a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA).

Quantized directly from the full BF16 checkpoint (zai-org/GLM-5), not the FP8 release, to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.

MTP Layer Addition

The original lukealonso/GLM-5-NVFP4 declares "num_nextn_predict_layers": 1 in its config but ships without the MTP layer weights (layer 78). This repo fixes that by extracting the MTP layer directly from the original BF16 model (zai-org/GLM-5).

What was done:

  • Extracted all 791 tensors for model.layers.78.* from the BF16 model (shards 271โ€“274 of 282) and saved them as mtp.safetensors (~19 GB, full BF16 precision)
  • Updated model.safetensors.index.json to include the 791 layer 78 โ†’ mtp.safetensors mappings
  • Added model.layers.78.* glob patterns to quantization_config.ignore in both config.json and hf_quant_config.json so the NVFP4 dequantizer skips the MTP layer
  • All other weights (layers 0โ€“77, embeddings, lm_head) remain unchanged from the original NVFP4 quantization

The MTP layer contains:

  • Special MTP components: eh_proj, enorm, hnorm, shared_head.norm
  • A complete transformer block: self-attention + MoE MLP (256 experts)

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections) for layers 0โ€“77 are quantized to NVFP4. Attention layers are left in BF16. The MTP layer (layer 78) is entirely in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Three calibration passes were run:

  1. Coding pass โ€” Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
  2. Broad pass โ€” Large-scale diverse samples drawn from WildChat and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
  3. Deep pass โ€” Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.

Merged via element-wise max across all calibration runs.

How to Run

NVFP4 requires Blackwell GPUs (RTX 5090, RTX Pro 6000, B100, B200, etc.). Even quantized, this is a huge model โ€” tested on 8x RTX Pro 6000 Blackwell (96 GB each, 768 GB total).

If you experience NCCL hangs with P2P, make sure you have iommu=pt (and amd_iommu=pt on AMD platforms) in your kernel command line.

SGLang (with MTP speculative decoding)

export NCCL_IB_DISABLE=1
export NCCL_P2P_LEVEL=PHB
export NCCL_ALLOC_P2P_NET_LL_BUFFERS=1
export NCCL_MIN_NCHANNELS=8
export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1

python3 -m sglang.launch_server \
  --model festr2/GLM-5-NVFP4-MTP \
  --served-model-name glm-5 \
  --reasoning-parser glm45 \
  --tool-call-parser glm47 \
  --trust-remote-code \
  --tp 8 \
  --mem-fraction-static 0.95 \
  --max-running-requests 8 \
  --kv-cache-dtype fp8_e4m3 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --disable-custom-all-reduce \
  --enable-flashinfer-allreduce-fusion \
  --speculative-algorithm mtp \
  --num-speculative-tokens 1 \
  --host 0.0.0.0 \
  --port 8000

vLLM (with MTP speculative decoding)

vllm serve festr2/GLM-5-NVFP4-MTP \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Credits

Downloads last month
2,842
Safetensors
Model size
435B params
Tensor type
F32
ยท
BF16
ยท
F8_E4M3
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for festr2/GLM-5-NVFP4-MTP

Base model

zai-org/GLM-5
Quantized
(18)
this model