Instructions to use Mapika/GLM-5.2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Mapika/GLM-5.2-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Mapika/GLM-5.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Mapika/GLM-5.2-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("Mapika/GLM-5.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - TensorRT
How to use Mapika/GLM-5.2-NVFP4 with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Mapika/GLM-5.2-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Mapika/GLM-5.2-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mapika/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Mapika/GLM-5.2-NVFP4
- SGLang
How to use Mapika/GLM-5.2-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Mapika/GLM-5.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mapika/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Mapika/GLM-5.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mapika/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Mapika/GLM-5.2-NVFP4 with Docker Model Runner:
docker model run hf.co/Mapika/GLM-5.2-NVFP4
# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM
tokenizer = AutoTokenizer.from_pretrained("Mapika/GLM-5.2-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("Mapika/GLM-5.2-NVFP4")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))GLM-5.2-NVFP4
NVFP4 (4-bit) quantization of zai-org/GLM-5.2, produced with NVIDIA TensorRT Model Optimizer 0.44.0. The MoE expert FFNs (routed + shared) are quantized to NVFP4; attention (MLA + the DeepSeek-style DSA lightning indexer), the router, and the LM head are kept in BF16. This shrinks the checkpoint from 1.5 TB → 410 GB (~3.7×) while retaining GSM8K accuracy within ~2 points of BF16.
GLM-5.2 is a glm_moe_dsa model: DeepSeek-V3.2-style MLA attention + DSA sparse-attention indexer,
with a 256-routed-expert + 1-shared-expert MoE (8 experts/token), 78 layers, hidden 6144, vocab 154880.
Evaluation
All benchmarks were served via SGLang and scored with lm-evaluation-harness on the same hardware and
harness for both NVFP4 and BF16 (generative / chain-of-thought where applicable; max_gen_toks raised
to fit the reasoning chains — lm-eval's default 256 truncates them and tanks the scores).
| Benchmark | GLM-5.2-NVFP4 (410 GB) | GLM-5.2 BF16 (1507 GB) | Δ |
|---|---|---|---|
| GPQA-Diamond (CoT, flexible) | 69.70 | 69.70 | 0.00 |
| MATH-500 (minerva) | 86.80 | 86.60 | +0.20 |
| MMLU-Pro (generative, 50/subject) | 81.14 | 82.43 | −1.29 |
| HumanEval (pass@1, instruct) | 94.51 | 95.73 | −1.22 |
| GSM8K (5-shot, flexible) | 92.72 | 94.92 | −2.20 |
NVFP4 holds up strongly on the hard, non-saturated benchmarks: GPQA-Diamond and MATH-500 are within noise of BF16, and the average degradation across the suite is ~1 point — for a 3.7× smaller checkpoint.
Quantization recipe
- Format: NVFP4 (FP4 weights + FP8 block scales), block/group size 16,
modeloptproducer. - Quantized:
mlp.experts.*(256 routed experts) andmlp.shared_experts.*. - Kept in BF16 (excluded): all of
self_attn.*— MLA projections (q/kv) and the DSA indexer — plus the MoE router (mlp.gate) andlm_head. The indexer and MLA attention must stay BF16: SGLang'sdeepseek_v2MLA path (used forglm_moe_dsa) cannot consume NVFP4 attention weights. - KV cache: not quantized.
- Calibration: 512 samples × 2048 tokens from cnn_dailymail + nvidia/OpenCodeReasoning + nvidia/OpenMathReasoning.
Serving (SGLang)
Requires SGLang ≥ v0.5.13.post1 (the version that registers GlmMoeDsaForCausalLM).
docker run --runtime=nvidia --gpus '"device=0,1,2,3"' --ipc=host --shm-size=32g \
-v /path/to/GLM-5.2-NVFP4:/model -p 30000:30000 \
lmsysorg/sglang:v0.5.13.post1-cu130 \
sglang serve --model-path /model --tp 4 \
--quantization modelopt_fp4 --moe-runner-backend flashinfer_cutlass \
--context-length 32768 --mem-fraction-static 0.85 \
--tool-call-parser auto --trust-remote-code --host 0.0.0.0 --port 30000
GPU memory. The weights are ~410 GB, so per-GPU footprint depends on TP:
| Tensor parallel | Weights / GPU | Suitable GPUs |
|---|---|---|
--tp 4 |
~110 GB | ≥128 GB cards — H200 (141 GB, tight KV), B200 / B300, MI300X (192 GB) |
--tp 8 |
~55 GB | 80 GB cards — 8× H100 or A100-80GB |
So 80 GB GPUs need --tp 8, not --tp 4 (110 GB of weights can't fit in an 80 GB card). Lower
--mem-fraction-static if KV-cache space is tight. Use a generous max_tokens at inference — GLM-5.2 is
a reasoning model and its <think> chains can be long.
Notes
- Quantized with
nvfp4+ a smallbuild_quant_cfgexclusion that keepsself_attn.*in BF16 (required for SGLang's MLA path). Same overall pipeline as our MiniMax-M3-NVFP4. - License inherited from the base model (MIT, Zhipu AI).
- Downloads last month
- -
Model tree for Mapika/GLM-5.2-NVFP4
Base model
zai-org/GLM-5.2
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Mapika/GLM-5.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)