Instructions to use Mapika/GLM-5.2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Mapika/GLM-5.2-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Mapika/GLM-5.2-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("Mapika/GLM-5.2-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("Mapika/GLM-5.2-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

TensorRT

How to use Mapika/GLM-5.2-NVFP4 with TensorRT:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Mapika/GLM-5.2-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Mapika/GLM-5.2-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mapika/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Mapika/GLM-5.2-NVFP4

SGLang

How to use Mapika/GLM-5.2-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Mapika/GLM-5.2-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mapika/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Mapika/GLM-5.2-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mapika/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Mapika/GLM-5.2-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Mapika/GLM-5.2-NVFP4
```

GLM-5.2-NVFP4 / README.md

Mapika

Upload README.md with huggingface_hub

5f9c62a verified 3 days ago

preview code

Raw

History Blame Contribute Delete

3.96 kB

	---
	base_model: zai-org/GLM-5.2
	base_model_relation: quantized
	license: mit
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- nvfp4
	- fp4
	- quantization
	- modelopt
	- tensorrt
	- moe
	- glm
	- sglang
	---

	# GLM-5.2-NVFP4

	NVFP4 (4-bit) quantization of [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2), produced with
	[NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) 0.44.0. The MoE expert FFNs
	(routed + shared) are quantized to NVFP4; attention (MLA + the DeepSeek-style DSA lightning indexer),
	the router, and the LM head are kept in BF16. This shrinks the checkpoint from 1.5 TB → 410 GB (~3.7×)
	while retaining GSM8K accuracy within ~2 points of BF16.

	GLM-5.2 is a `glm_moe_dsa` model: DeepSeek-V3.2-style MLA attention + DSA sparse-attention indexer,
	with a 256-routed-expert + 1-shared-expert MoE (8 experts/token), 78 layers, hidden 6144, vocab 154880.

	## Evaluation

	All benchmarks were served via SGLang and scored with lm-evaluation-harness on the **same hardware and
	harness** for both NVFP4 and BF16 (generative / chain-of-thought where applicable; `max_gen_toks` raised
	to fit the reasoning chains — lm-eval's default 256 truncates them and tanks the scores).

	\| Benchmark \| GLM-5.2-NVFP4 (410 GB) \| GLM-5.2 BF16 (1507 GB) \| Δ \|
	\|---\|---\|---\|---\|
	\| GPQA-Diamond (CoT, flexible) \| 69.70 \| 69.70 \| 0.00 \|
	\| MATH-500 (minerva) \| 86.80 \| 86.60 \| +0.20 \|
	\| MMLU-Pro (generative, 50/subject) \| 81.14 \| 82.43 \| −1.29 \|
	\| HumanEval (pass@1, instruct) \| 94.51 \| 95.73 \| −1.22 \|
	\| GSM8K (5-shot, flexible) \| 92.72 \| 94.92 \| −2.20 \|

	NVFP4 holds up strongly on the hard, non-saturated benchmarks: GPQA-Diamond and MATH-500 are within
	noise of BF16, and the average degradation across the suite is ~1 point — for a 3.7× smaller checkpoint.

	## Quantization recipe

	- Format: NVFP4 (FP4 weights + FP8 block scales), block/group size 16, `modelopt` producer.
	- Quantized: `mlp.experts.` (256 routed experts) and `mlp.shared_experts.`.
	- Kept in BF16 (excluded): all of `self_attn.` — MLA projections (q/kv) and* the DSA indexer —
	plus the MoE router (`mlp.gate`) and `lm_head`. The indexer and MLA attention must stay BF16:
	SGLang's `deepseek_v2` MLA path (used for `glm_moe_dsa`) cannot consume NVFP4 attention weights.
	- KV cache: not quantized.
	- Calibration: 512 samples × 2048 tokens from cnn_dailymail + nvidia/OpenCodeReasoning +
	nvidia/OpenMathReasoning.

	## Serving (SGLang)

	Requires SGLang ≥ v0.5.13.post1 (the version that registers `GlmMoeDsaForCausalLM`).

	```bash
	docker run --runtime=nvidia --gpus '"device=0,1,2,3"' --ipc=host --shm-size=32g \
	-v /path/to/GLM-5.2-NVFP4:/model -p 30000:30000 \
	lmsysorg/sglang:v0.5.13.post1-cu130 \
	sglang serve --model-path /model --tp 4 \
	--quantization modelopt_fp4 --moe-runner-backend flashinfer_cutlass \
	--context-length 32768 --mem-fraction-static 0.85 \
	--tool-call-parser auto --trust-remote-code --host 0.0.0.0 --port 30000
	```

	GPU memory. The weights are ~410 GB, so per-GPU footprint depends on TP:

	\| Tensor parallel \| Weights / GPU \| Suitable GPUs \|
	\|---\|---\|---\|
	\| `--tp 4` \| ~110 GB \| ≥128 GB cards — H200 (141 GB, tight KV), B200 / B300, MI300X (192 GB) \|
	\| `--tp 8` \| ~55 GB \| 80 GB cards — 8× H100 or A100-80GB \|

	So 80 GB GPUs need `--tp 8`, not `--tp 4` (110 GB of weights can't fit in an 80 GB card). Lower
	`--mem-fraction-static` if KV-cache space is tight. Use a generous `max_tokens` at inference — GLM-5.2 is
	a reasoning model and its `<think>` chains can be long.

	## Notes

	- Quantized with `nvfp4` + a small `build_quant_cfg` exclusion that keeps `self_attn.*` in BF16 (required
	for SGLang's MLA path). Same overall pipeline as our [MiniMax-M3-NVFP4](https://huggingface.co/Mapika/MiniMax-M3-NVFP4).
	- License inherited from the base model (MIT, Zhipu AI).