YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
GLM-4.6 NVFP4
NVFP4 quantization of zai-org/GLM-4.6, produced using NVIDIA ModelOpt.
Quantization
- Blockwise scaling — every 16 consecutive weight values share their own float8 scale factor, so each small neighborhood gets its own calibration rather than a single global range
- Calibration computes those scales by running real text through the model and measuring the actual range of values each weight block produces — this is what separates a good quantization from a bad one
- What is quantized: all Linear layers — MLP and MoE expert projections (gate, up, down) across all 92 transformer layers, covering the vast majority of parameters. Both weights and input activations are quantized at group_size=16
- What is not quantized: all self-attention layers across all 92 layers, and lm_head — these remain in BF16. Attention weights are precision-sensitive and represent a small fraction of total memory, so the quality tradeoff is not worth it
- Result: model goes from ~700GB in BF16 to ~220GB in NVFP4, deployable on 4–8 H100/H200/B200 GPUs with modest quality degradation due to calibrated blockwise scales
Usage
SGLang
python3 -m sglang.launch_server \
--model AH22-neb/GLM-4.6-NVFP4 \
--trust-remote-code \
--tp 8 \
--quantization modelopt_fp4 \
--attention-backend flashinfer \
--moe-runner-backend flashinfer_cutlass \
--kv-cache-dtype fp8_e4m3 \
--host 0.0.0.0 --port 8000
vLLM
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve AH22-neb/GLM-4.6-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 8 \
--quantization modelopt_fp4 \
--dtype bfloat16 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
License
MIT — see base model license.
- Downloads last month
- 68
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support