GLM-4.6 NVFP4

NVFP4 quantization of zai-org/GLM-4.6, produced using NVIDIA ModelOpt.

Quantization

Blockwise scaling — every 16 consecutive weight values share their own float8 scale factor, so each small neighborhood gets its own calibration rather than a single global range
Calibration computes those scales by running real text through the model and measuring the actual range of values each weight block produces — this is what separates a good quantization from a bad one
What is quantized: all Linear layers — MLP and MoE expert projections (gate, up, down) across all 92 transformer layers, covering the vast majority of parameters. Both weights and input activations are quantized at group_size=16
What is not quantized: all self-attention layers across all 92 layers, and lm_head — these remain in BF16. Attention weights are precision-sensitive and represent a small fraction of total memory, so the quality tradeoff is not worth it
Result: model goes from ~700GB in BF16 to ~220GB in NVFP4, deployable on 4–8 H100/H200/B200 GPUs with modest quality degradation due to calibrated blockwise scales

Usage

SGLang

python3 -m sglang.launch_server \
  --model AH22-neb/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tp 8 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --host 0.0.0.0 --port 8000

vLLM

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve AH22-neb/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --quantization modelopt_fp4 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

License

MIT — see base model license.

Downloads last month: 1

Safetensors

Model size

183B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support