GLM-5.1-NVFP4

Mixed-precision (NVFP4/FP8/BF16) quantized version of zai-org/GLM-5.1.

This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, applying a layerwise mixed-precision recipe that balances compression with output quality.

Quantization Strategy

Non-uniform mixed-precision quantization with calibration:

Precision Layers
FP8 W8A8 MLA projections (q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj); all down_proj (dense + expert + shared); DSA indexer
NVFP4 W4A4 MLP gate_proj/up_proj (256 routed experts + shared expert + dense layers)
BF16 lm_head, embed_tokens, MoE router gates, norms

Architecture match with the BF16 source:

  • model_type=glm_moe_dsa

  • 78 layers (3 dense + 75 MoE, first_k_dense_replace=3)

  • n_routed_experts=256, num_experts_per_tok=8, n_shared_experts=1

  • max_position_embeddings=202752

  • hidden_size=6144, moe_intermediate_size=2048

  • vocab_size=154880

Calibration

  • 512 self-calibration samples generated from GLM-5.1 via OpenRouter (top-tier provider routing)
  • 8 diverse categories: math, code, logic, analysis, creative writing, general knowledge, agentic/tool-calling, Korean
  • Reasoning traces included for natural distribution coverage
  • Static activation scales computed per-module from calibration data

Usage

vLLM

vllm serve mconcat/GLM-5.1-NVFP4 \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code

Compatibility

Framework Supported Notes
vLLM >= 0.19.0 Partial See known issues below
SGLang No compressed-tensors NVFP4 not supported
transformers >= 5.4.0 Yes Direct loading with device_map="auto"

Known Issues

vLLM fused MoE limitation: vLLM's fused MoE kernel requires uniform quantization across all expert projections (gate/up/down). This checkpoint uses mixed-precision (NVFP4 for gate/up, FP8 for down), which may cause ValueError: All MoE projections need to have same quantization scheme.

Workarounds:

  1. Use the GLM-5.1-FP8-Dynamic checkpoint which uses uniform FP8
  2. Wait for vLLM to add mixed-precision MoE support
  3. Use transformers with device_map="auto" for non-fused inference

Notes

  • This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
  • GLM-5.1 does not ship MTP weights despite num_nextn_predict_layers=1 in config.
  • Quantization was performed layer-by-layer using compressed-tensors for proper NVFP4 packing (weight_packed uint8, FP4 E2M1 format).
  • KV cache: Do not use --kv-cache-dtype fp8_e4m3 — the checkpoint lacks calibrated KV scales.

Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)

If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support. See GLM-5.1-FP8-Dynamic README for patch instructions.

Quantization Process

  • Tool: Custom layer-by-layer pipeline with compressed-tensors NVFP4 packing
  • Hardware: Single NVIDIA RTX PRO 6000 Blackwell (96 GB), processed one layer at a time
  • Time: ~161 minutes for 78 layers
  • Calibration: 256 samples, per-module activation min/max statistics with MoE expert input hooks
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/GLM-5.1-NVFP4

Base model

zai-org/GLM-5.1
Quantized
(24)
this model

Collection including mconcat/GLM-5.1-NVFP4