GLM-5.1-NVFP4

Mixed-precision (NVFP4/FP8/BF16) quantized version of zai-org/GLM-5.1.

This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, applying a layerwise mixed-precision recipe that balances compression with output quality.

Quantization Strategy

Non-uniform mixed-precision quantization with calibration:

Precision	Layers
FP8 W8A8	MLA projections (`q_a_proj`, `q_b_proj`, `kv_a_proj_with_mqa`, `kv_b_proj`, `o_proj`); all `down_proj` (dense + expert + shared); DSA indexer
NVFP4 W4A4	MLP `gate_proj`/`up_proj` (256 routed experts + shared expert + dense layers)
BF16	`lm_head`, `embed_tokens`, MoE router gates, norms

Architecture match with the BF16 source:

model_type=glm_moe_dsa
78 layers (3 dense + 75 MoE, first_k_dense_replace=3)
n_routed_experts=256, num_experts_per_tok=8, n_shared_experts=1
max_position_embeddings=202752
hidden_size=6144, moe_intermediate_size=2048
vocab_size=154880

Calibration

512 self-calibration samples generated from GLM-5.1 via OpenRouter (top-tier provider routing)
8 diverse categories: math, code, logic, analysis, creative writing, general knowledge, agentic/tool-calling, Korean
Reasoning traces included for natural distribution coverage
Static activation scales computed per-module from calibration data

Usage

vLLM

vllm serve mconcat/GLM-5.1-NVFP4 \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code

Compatibility

Framework	Supported	Notes
vLLM >= 0.19.0	Partial	See known issues below
SGLang	No	compressed-tensors NVFP4 not supported
transformers >= 5.4.0	Yes	Direct loading with `device_map="auto"`

Known Issues

vLLM fused MoE limitation: vLLM's fused MoE kernel requires uniform quantization across all expert projections (gate/up/down). This checkpoint uses mixed-precision (NVFP4 for gate/up, FP8 for down), which may cause ValueError: All MoE projections need to have same quantization scheme.

Workarounds:

Use the GLM-5.1-FP8-Dynamic checkpoint which uses uniform FP8
Wait for vLLM to add mixed-precision MoE support
Use transformers with device_map="auto" for non-fused inference

Notes

This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
GLM-5.1 does not ship MTP weights despite num_nextn_predict_layers=1 in config.
Quantization was performed layer-by-layer using compressed-tensors for proper NVFP4 packing (weight_packed uint8, FP4 E2M1 format).
KV cache: Do not use --kv-cache-dtype fp8_e4m3 — the checkpoint lacks calibrated KV scales.

Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)

If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support. See GLM-5.1-FP8-Dynamic README for patch instructions.

Quantization Process

Tool: Custom layer-by-layer pipeline with compressed-tensors NVFP4 packing
Hardware: Single NVIDIA RTX PRO 6000 Blackwell (96 GB), processed one layer at a time
Time: ~161 minutes for 78 layers
Calibration: 256 samples, per-module activation min/max statistics with MoE expert input hooks

Downloads last month: -

Model tree for mconcat/GLM-5.1-NVFP4

Base model

zai-org/GLM-5.1

Quantized

(24)

this model

Collection including mconcat/GLM-5.1-NVFP4

GLM 5.1 Quants

Collection

3 items • Updated about 19 hours ago