GLM-5.2-NVFP4 / README.md
Mapika's picture
Upload README.md with huggingface_hub
5f9c62a verified
|
Raw
History Blame Contribute Delete
3.96 kB
---
base_model: zai-org/GLM-5.2
base_model_relation: quantized
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- nvfp4
- fp4
- quantization
- modelopt
- tensorrt
- moe
- glm
- sglang
---
# GLM-5.2-NVFP4
**NVFP4 (4-bit) quantization of [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2)**, produced with
[NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) 0.44.0. The MoE expert FFNs
(routed + shared) are quantized to NVFP4; attention (MLA + the DeepSeek-style DSA lightning indexer),
the router, and the LM head are kept in BF16. This shrinks the checkpoint from **1.5 TB β†’ 410 GB (~3.7Γ—)**
while retaining GSM8K accuracy within ~2 points of BF16.
GLM-5.2 is a `glm_moe_dsa` model: DeepSeek-V3.2-style **MLA attention + DSA sparse-attention indexer**,
with a **256-routed-expert + 1-shared-expert** MoE (8 experts/token), 78 layers, hidden 6144, vocab 154880.
## Evaluation
All benchmarks were served via SGLang and scored with lm-evaluation-harness on the **same hardware and
harness** for both NVFP4 and BF16 (generative / chain-of-thought where applicable; `max_gen_toks` raised
to fit the reasoning chains β€” lm-eval's default 256 truncates them and tanks the scores).
| Benchmark | GLM-5.2-NVFP4 (410 GB) | GLM-5.2 BF16 (1507 GB) | Ξ” |
|---|---|---|---|
| GPQA-Diamond (CoT, flexible) | **69.70** | 69.70 | **0.00** |
| MATH-500 (minerva) | **86.80** | 86.60 | **+0.20** |
| MMLU-Pro (generative, 50/subject) | **81.14** | 82.43 | βˆ’1.29 |
| HumanEval (pass@1, instruct) | **94.51** | 95.73 | βˆ’1.22 |
| GSM8K (5-shot, flexible) | **92.72** | 94.92 | βˆ’2.20 |
NVFP4 holds up strongly on the **hard, non-saturated** benchmarks: GPQA-Diamond and MATH-500 are within
noise of BF16, and the average degradation across the suite is **~1 point** β€” for a **3.7Γ— smaller** checkpoint.
## Quantization recipe
- **Format:** NVFP4 (FP4 weights + FP8 block scales), block/group size 16, `modelopt` producer.
- **Quantized:** `mlp.experts.*` (256 routed experts) and `mlp.shared_experts.*`.
- **Kept in BF16 (excluded):** all of `self_attn.*` β€” MLA projections (q/kv) **and** the DSA indexer β€”
plus the MoE router (`mlp.gate`) and `lm_head`. The indexer and MLA attention **must** stay BF16:
SGLang's `deepseek_v2` MLA path (used for `glm_moe_dsa`) cannot consume NVFP4 attention weights.
- **KV cache:** not quantized.
- **Calibration:** 512 samples Γ— 2048 tokens from cnn_dailymail + nvidia/OpenCodeReasoning +
nvidia/OpenMathReasoning.
## Serving (SGLang)
Requires **SGLang β‰₯ v0.5.13.post1** (the version that registers `GlmMoeDsaForCausalLM`).
```bash
docker run --runtime=nvidia --gpus '"device=0,1,2,3"' --ipc=host --shm-size=32g \
-v /path/to/GLM-5.2-NVFP4:/model -p 30000:30000 \
lmsysorg/sglang:v0.5.13.post1-cu130 \
sglang serve --model-path /model --tp 4 \
--quantization modelopt_fp4 --moe-runner-backend flashinfer_cutlass \
--context-length 32768 --mem-fraction-static 0.85 \
--tool-call-parser auto --trust-remote-code --host 0.0.0.0 --port 30000
```
**GPU memory.** The weights are ~410 GB, so per-GPU footprint depends on TP:
| Tensor parallel | Weights / GPU | Suitable GPUs |
|---|---|---|
| `--tp 4` | ~110 GB | β‰₯128 GB cards β€” H200 (141 GB, tight KV), B200 / B300, MI300X (192 GB) |
| `--tp 8` | ~55 GB | 80 GB cards β€” 8Γ— H100 or A100-80GB |
So **80 GB GPUs need `--tp 8`, not `--tp 4`** (110 GB of weights can't fit in an 80 GB card). Lower
`--mem-fraction-static` if KV-cache space is tight. Use a generous `max_tokens` at inference β€” GLM-5.2 is
a reasoning model and its `<think>` chains can be long.
## Notes
- Quantized with `nvfp4` + a small `build_quant_cfg` exclusion that keeps `self_attn.*` in BF16 (required
for SGLang's MLA path). Same overall pipeline as our [MiniMax-M3-NVFP4](https://huggingface.co/Mapika/MiniMax-M3-NVFP4).
- License inherited from the base model (MIT, Zhipu AI).