YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

GLM-4.6 NVFP4

NVFP4 quantization of zai-org/GLM-4.6, produced using NVIDIA ModelOpt.

Quantization

  • Blockwise scaling — every 16 consecutive weight values share their own float8 scale factor, so each small neighborhood gets its own calibration rather than a single global range
  • Calibration computes those scales by running real text through the model and measuring the actual range of values each weight block produces — this is what separates a good quantization from a bad one
  • What is quantized: all Linear layers — MLP and MoE expert projections (gate, up, down) across all 92 transformer layers, covering the vast majority of parameters. Both weights and input activations are quantized at group_size=16
  • What is not quantized: all self-attention layers across all 92 layers, and lm_head — these remain in BF16. Attention weights are precision-sensitive and represent a small fraction of total memory, so the quality tradeoff is not worth it
  • Result: model goes from ~700GB in BF16 to ~220GB in NVFP4, deployable on 4–8 H100/H200/B200 GPUs with modest quality degradation due to calibrated blockwise scales

Usage

SGLang

python3 -m sglang.launch_server \
  --model AH22-neb/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tp 8 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --host 0.0.0.0 --port 8000

vLLM

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve AH22-neb/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --quantization modelopt_fp4 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

License

MIT — see base model license.

Downloads last month
68
Safetensors
Model size
183B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support