GLM-5.2 — W4A16 (INT4) + BF16 MTP

An INT4 weight-only (W4A16) quantization of GLM-5.2 that preserves the BF16 multi-token-prediction (MTP) layer for speculative decoding. Quantized from zai-org/GLM-5.2 with llm-compressor (GPTQ).

Built for Hopper (H200). Matches FP8 quality on half the GPUs (4×H200 vs 8) and is the fastest GLM-5.2 quant for interactive/agentic serving on Hopper in a matched, MTP-on head-to-head — with the lowest time-to-first-token by a wide margin.

Why this model

Half the footprint, FP8 quality. ~405 GB of weights (down from ~1.49 TB BF16) serve one replica on 4×H200 instead of 8 — freeing half the fleet, or two replicas per node — and eval matches the FP8 baseline within noise across reasoning, instruction-following, long-context, and agentic coding.
Fastest interactive serving among GLM-5.2 quants on Hopper. In a matched benchmark (every model with MTP on, same box, same vLLM, same harness): +8% vs nvidia NVFP4 and +33% vs zai FP8 at concurrency 1, with TTFT of 215 ms vs 632/1258 ms.
Honest trade-off. MTP's draft/verify overhead stops paying off once the batch saturates — at c32 the NVFP4/FP8 quants are ~11% faster. If your workload is fully-saturated batch serving, pick by that row.

Throughput — matched MTP-on comparison (8×H200, vLLM v0.23.0, same harness)

All three models served identically (--speculative-config '{"method":"mtp","num_speculative_tokens":5}', TP=8, fp8 KV); benchmarked with vllm bench serve (openai-chat endpoint, random 1024-in/512-out, num-prompts = 8×concurrency). Measured 2026-07-02 on AWS p5e.48xlarge.

concurrency	This (W4A16 + MTP)	nvidia NVFP4 + MTP	zai FP8 + MTP
1 — tok/s (TTFT)	125.7 (215 ms)	116.3 (632 ms)	94.4 (1258 ms)
8	495.7 (319 ms)	455.5 (422 ms)	394.2 (742 ms)
32	828.4 (413 ms)	925.2 (403 ms)	921.4 (412 ms)

MTP spec-decode acceptance was ~28% for all three models on this synthetic workload (higher, ~46–52%, on natural eval traffic) — the draft head performs the same across quants, so this is a clean quant-vs-quant comparison. Note NVFP4 is a Blackwell-native format measured here on Hopper, where it has no FP4 tensor cores; treat its column as a Hopper-deployment number.

Purpose

GLM-5.2 (744B-parameter MoE) in BF16 needs ~1.49 TB of weights — eight 141 GB H200s, fully occupied, to serve one replica. The goal of this artifact is a smaller-footprint variant that matches FP8 quality so the model runs on four H200s instead of eight (freeing half the fleet, or two replicas per node), while keeping the MTP draft head for speculative-decode speedups. It is a deployment-efficiency artifact, not a new model — all capability comes from the base GLM-5.2.

Details

Field	Value
Base model	zai-org/GLM-5.2 (BF16)
Architecture	`GlmMoeDsaForCausalLM` — 744B MoE, ~40B active, MLA + DeepSeek Sparse Attention, 1M context
Weight quantization	W4A16, INT4, asymmetric, group-size 128 (GPTQ, compressed-tensors), routed experts only
Kept in BF16	attention, dense layers (0–2), shared experts, router/gate, embeddings, lm_head, MTP layer 78
MTP	layer 78 preserved at BF16 for spec-decode (`num_speculative_tokens=5`)
Calibration	in-distribution chat/code set; `calibrate_all_experts=True` (visits every expert — see Method)
Size	~405 GB (from ~1488 GB BF16)
License	MIT (inherited from the base model)

The "FP8" sometimes seen in the filename refers to the fp8 KV-cache used at serving time, not the weights — the weights are INT4 (W4A16) and the MTP layer is BF16.

Evaluation — vs the FP8 baseline (same harness, 8×H200)

Measured against zai-org/GLM-5.2-FP8 under an identical setup (generative tasks via chat-completions with a 16,384-token generation budget for the reasoning CoT; matched serve config with --reasoning-parser). Quality is measured with speculative decoding off, where it is exact — MTP changes latency, not outputs.

Task	This (W4A16+MTP)	FP8 baseline
GSM8K (strict)	0.960	0.955
IFEval (prompt-strict / inst-strict)	0.909 / 0.911	0.891 / 0.903
MATH-500 (math-verify)	0.954	0.958
RULER @ 32K	0.832	0.831
RULER @ 64K	0.841	0.813
SWE-bench Verified (mini-SWE-agent + official grading)	82.0% (410/500)	82.2% (411/500)

Quantization preserves quality: scores track the FP8 baseline within run-to-run noise on reasoning, instruction-following, long-context retrieval, and agentic coding. (MMLU-Pro: FP8 full-set = 0.820; the W4A16 subset run was not completed — the verdict was already conclusive from the six tasks above. RULER used 50 samples per sub-task, not the full 500.)

Long context: serves at max_model_len=1,048,576 on 8×H200 and correctly retrieved a needle from a ~936K-token prompt (MLA + DSA compress the KV cache enough to fit 1M in the memory free after weights). On 4×H200 it serves 128K validated (single-stream engine ceiling ~239K at gpu-memory-utilization=0.92; 256K overflows the post-weights KV budget) and retrieved a 64K needle at both mid- and end-placement.

MTP: speculative-decode acceptance ~~46–52% aggregate (~~95% at draft position 0) on natural eval traffic on 8×H200, confirming the injected BF16 MTP layer is healthy. On 4×H200 (TP=4, 128K) aggregate acceptance is ~38% (7,848/20,765 draft tokens, mean accept-length ~2.9) — mildly lower under the tighter memory split but still a net speedup.

Serving (vLLM ≥ 0.23, Hopper / H200)

The asymmetric W4A16 MoE requires expert parallelism (--enable-expert-parallel); plain tensor-parallel trips a Marlin scale-sharding bug. The DSA indexer needs an nvcc ≥ 12.8 toolchain (CUDA_HOME). Validated on vLLM v0.23.0 (newer versions changed the DSA indexer layout — v0.24+ currently fails to load this checkpoint's per-layer indexer weights; pin v0.23.x until upstream support lands).

8×H200 (up to 1M context):

vllm serve <repo> \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
  --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
  --max-model-len 1048576 --gpu-memory-utilization 0.90 --trust-remote-code

4×H200 (the footprint win, 128K validated / ~239K single-stream ceiling — 1M needs all 8):

vllm serve <repo> --tensor-parallel-size 4 --enable-expert-parallel \
  --kv-cache-dtype fp8 --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
  --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
  --max-model-len 32768 --gpu-memory-utilization 0.92 --trust-remote-code

Method

GPTQ W4A16 (group-128, asymmetric) on the routed experts only, with attention/dense/MTP/embeddings/ lm_head held at BF16. calibrate_all_experts=True is required — calibrating only routed experts starves rarely-activated experts and produces a coherent-looking but degenerate model.
MTP preservation (Option-Y): GLM-5.2's MTP/nextn layer (index 78) isn't instantiated by from_pretrained, so quantization never sees it. It is injected back at BF16 from the source checkpoint after quantization and added to the ignore list so the serving stack treats it as unquantized.

The full recipe, evaluation methodology, and a log of the engineering walls hit and overcome are in the companion repository (calibration memory limits, MoE sequential-target OOMs, the MTP-loss-on-save issue, the asymmetric-MoE serving fix, and the Blackwell toolchain gaps).

Limitations

The interactive edge is +8% vs NVFP4+MTP / +33% vs FP8+MTP at c1; at full saturation (c32) those quants are ~11% faster. Pick by your operating point.
1M-context serving requires all 8 H200s; 4×H200 serves up to ~128K (single-stream engine ceiling ~239K), with MTP acceptance ~38% (vs ~46–52% on 8×H200).
Asymmetric weights require --enable-expert-parallel to serve correctly.
Pin vLLM v0.23.x (v0.24+ DSA-indexer layout change breaks loading). Blackwell serving needs additional kernel flags; recommended on Hopper.

Acknowledgements

Built on zai-org/GLM-5.2 (MIT). Quantized with llm-compressor; served with vLLM.

Downloads last month: 138

Safetensors

Model size

116B params

Tensor type

I64

F32

I32

BF16

Model tree for canada-quant/GLM-5.2-W4A16-MTP

Base model

zai-org/GLM-5.2

Quantized

(77)

this model