GLM-5.2-NVFP4 / README.md
Mapika's picture
Upload README.md with huggingface_hub
5f9c62a verified
|
Raw
History Blame Contribute Delete
3.96 kB
metadata
base_model: zai-org/GLM-5.2
base_model_relation: quantized
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
  - nvfp4
  - fp4
  - quantization
  - modelopt
  - tensorrt
  - moe
  - glm
  - sglang

GLM-5.2-NVFP4

NVFP4 (4-bit) quantization of zai-org/GLM-5.2, produced with NVIDIA TensorRT Model Optimizer 0.44.0. The MoE expert FFNs (routed + shared) are quantized to NVFP4; attention (MLA + the DeepSeek-style DSA lightning indexer), the router, and the LM head are kept in BF16. This shrinks the checkpoint from 1.5 TB β†’ 410 GB (~3.7Γ—) while retaining GSM8K accuracy within ~2 points of BF16.

GLM-5.2 is a glm_moe_dsa model: DeepSeek-V3.2-style MLA attention + DSA sparse-attention indexer, with a 256-routed-expert + 1-shared-expert MoE (8 experts/token), 78 layers, hidden 6144, vocab 154880.

Evaluation

All benchmarks were served via SGLang and scored with lm-evaluation-harness on the same hardware and harness for both NVFP4 and BF16 (generative / chain-of-thought where applicable; max_gen_toks raised to fit the reasoning chains β€” lm-eval's default 256 truncates them and tanks the scores).

Benchmark GLM-5.2-NVFP4 (410 GB) GLM-5.2 BF16 (1507 GB) Ξ”
GPQA-Diamond (CoT, flexible) 69.70 69.70 0.00
MATH-500 (minerva) 86.80 86.60 +0.20
MMLU-Pro (generative, 50/subject) 81.14 82.43 βˆ’1.29
HumanEval (pass@1, instruct) 94.51 95.73 βˆ’1.22
GSM8K (5-shot, flexible) 92.72 94.92 βˆ’2.20

NVFP4 holds up strongly on the hard, non-saturated benchmarks: GPQA-Diamond and MATH-500 are within noise of BF16, and the average degradation across the suite is ~1 point β€” for a 3.7Γ— smaller checkpoint.

Quantization recipe

  • Format: NVFP4 (FP4 weights + FP8 block scales), block/group size 16, modelopt producer.
  • Quantized: mlp.experts.* (256 routed experts) and mlp.shared_experts.*.
  • Kept in BF16 (excluded): all of self_attn.* β€” MLA projections (q/kv) and the DSA indexer β€” plus the MoE router (mlp.gate) and lm_head. The indexer and MLA attention must stay BF16: SGLang's deepseek_v2 MLA path (used for glm_moe_dsa) cannot consume NVFP4 attention weights.
  • KV cache: not quantized.
  • Calibration: 512 samples Γ— 2048 tokens from cnn_dailymail + nvidia/OpenCodeReasoning + nvidia/OpenMathReasoning.

Serving (SGLang)

Requires SGLang β‰₯ v0.5.13.post1 (the version that registers GlmMoeDsaForCausalLM).

docker run --runtime=nvidia --gpus '"device=0,1,2,3"' --ipc=host --shm-size=32g \
  -v /path/to/GLM-5.2-NVFP4:/model -p 30000:30000 \
  lmsysorg/sglang:v0.5.13.post1-cu130 \
  sglang serve --model-path /model --tp 4 \
    --quantization modelopt_fp4 --moe-runner-backend flashinfer_cutlass \
    --context-length 32768 --mem-fraction-static 0.85 \
    --tool-call-parser auto --trust-remote-code --host 0.0.0.0 --port 30000

GPU memory. The weights are ~410 GB, so per-GPU footprint depends on TP:

Tensor parallel Weights / GPU Suitable GPUs
--tp 4 ~110 GB β‰₯128 GB cards β€” H200 (141 GB, tight KV), B200 / B300, MI300X (192 GB)
--tp 8 ~55 GB 80 GB cards β€” 8Γ— H100 or A100-80GB

So 80 GB GPUs need --tp 8, not --tp 4 (110 GB of weights can't fit in an 80 GB card). Lower --mem-fraction-static if KV-cache space is tight. Use a generous max_tokens at inference β€” GLM-5.2 is a reasoning model and its <think> chains can be long.

Notes

  • Quantized with nvfp4 + a small build_quant_cfg exclusion that keeps self_attn.* in BF16 (required for SGLang's MLA path). Same overall pipeline as our MiniMax-M3-NVFP4.
  • License inherited from the base model (MIT, Zhipu AI).