GLM-5.1-NVFP4
Mixed-precision (NVFP4/FP8/BF16) quantized version of zai-org/GLM-5.1.
This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, applying a layerwise mixed-precision recipe that balances compression with output quality.
Quantization Strategy
Non-uniform mixed-precision quantization with calibration:
| Precision | Layers |
|---|---|
| FP8 W8A8 | MLA projections (q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj); all down_proj (dense + expert + shared); DSA indexer |
| NVFP4 W4A4 | MLP gate_proj/up_proj (256 routed experts + shared expert + dense layers) |
| BF16 | lm_head, embed_tokens, MoE router gates, norms |
Architecture match with the BF16 source:
model_type=glm_moe_dsa78layers (3 dense + 75 MoE,first_k_dense_replace=3)n_routed_experts=256,num_experts_per_tok=8,n_shared_experts=1max_position_embeddings=202752hidden_size=6144,moe_intermediate_size=2048vocab_size=154880
Calibration
- 512 self-calibration samples generated from GLM-5.1 via OpenRouter (top-tier provider routing)
- 8 diverse categories: math, code, logic, analysis, creative writing, general knowledge, agentic/tool-calling, Korean
- Reasoning traces included for natural distribution coverage
- Static activation scales computed per-module from calibration data
Usage
vLLM
vllm serve mconcat/GLM-5.1-NVFP4 \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--trust-remote-code
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.19.0 | Partial | See known issues below |
| SGLang | No | compressed-tensors NVFP4 not supported |
| transformers >= 5.4.0 | Yes | Direct loading with device_map="auto" |
Known Issues
vLLM fused MoE limitation: vLLM's fused MoE kernel requires uniform quantization across all expert projections (gate/up/down). This checkpoint uses mixed-precision (NVFP4 for gate/up, FP8 for down), which may cause ValueError: All MoE projections need to have same quantization scheme.
Workarounds:
- Use the GLM-5.1-FP8-Dynamic checkpoint which uses uniform FP8
- Wait for vLLM to add mixed-precision MoE support
- Use transformers with
device_map="auto"for non-fused inference
Notes
- This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
- GLM-5.1 does not ship MTP weights despite
num_nextn_predict_layers=1in config. - Quantization was performed layer-by-layer using
compressed-tensorsfor proper NVFP4 packing (weight_packeduint8, FP4 E2M1 format). - KV cache: Do not use
--kv-cache-dtype fp8_e4m3— the checkpoint lacks calibrated KV scales.
Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)
If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support. See GLM-5.1-FP8-Dynamic README for patch instructions.
Quantization Process
- Tool: Custom layer-by-layer pipeline with
compressed-tensorsNVFP4 packing - Hardware: Single NVIDIA RTX PRO 6000 Blackwell (96 GB), processed one layer at a time
- Time: ~161 minutes for 78 layers
- Calibration: 256 samples, per-module activation min/max statistics with MoE expert input hooks
- Downloads last month
- -
Model tree for mconcat/GLM-5.1-NVFP4
Base model
zai-org/GLM-5.1