GLM-4.6 NVFP4

NVFP4 quantization of zai-org/GLM-4.6, produced using NVIDIA ModelOpt.

Model Details

Base model: zai-org/GLM-4.6
Architecture: Mixture-of-Experts, 357B total parameters (92 transformer layers, 160 routed experts per layer)
Quantization: NVFP4 (4-bit float, group_size=16, blockwise scales)
Calibration: 512 samples from BAAI/Infinity-Instruct, max sequence length 8192
Quantization tool: NVIDIA ModelOpt
Attention layers: kept in BF16 (not quantized)
Checkpoint format: pre-packed uint8 weights + blockwise float8_e4m3fn scales, directly loadable by SGLang and vLLM

Compatibility

Framework	Version	Status
SGLang	0.4+	✅ Tested
vLLM	0.16+	✅ Tested
CUDA SM	100a (B200)	✅ Tested
CUDA SM	89 (L40S / RTX 4090)	✅ Tested (15.1)

Usage

SGLang

python3 -m sglang.launch_server \
  --model ahanley22/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tp 8 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.88 \
  --host 0.0.0.0 \
  --port 8000

vLLM

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve ahanley22/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --quantization modelopt_fp4 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000

Inference (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="ahanley22/GLM-4.6-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
    temperature=1.0,
)
print(response.choices[0].message.content)

Quantization Notes

MoE expert layers (layers 3–91): calibrated blockwise scales from 512-sample calibration run
Dense MLP layers (layers 0–2): calibrated blockwise scales
Attention projections: excluded from quantization, remain BF16
lm_head: excluded from quantization, remains BF16
Scale format: float8_e4m3fn, shape [out_features, in_features // 16]
Weight format: uint8, shape [out_features, in_features // 2] (two FP4 values per byte)

Hardware Requirements

Minimum recommended: 4× H100/H200/B200 (80GB+) or equivalent for TP=4. For TP=8, 8× GPUs recommended for best throughput and to avoid OOM on large batches.

About GLM-4.6

GLM-4.6 is the latest flagship model from Z.AI's GLM series with 357B total parameters in a Mixture-of-Experts architecture. Key capabilities include a 200K token context window, strong coding and reasoning performance competitive with Claude Sonnet 4, advanced tool use and agentic capabilities, and refined writing quality.

License

MIT — see base model license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support