YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

GLM-4.6 NVFP4

NVFP4 quantization of zai-org/GLM-4.6, produced using NVIDIA ModelOpt.

Model Details

  • Base model: zai-org/GLM-4.6
  • Architecture: Mixture-of-Experts, 357B total parameters (92 transformer layers, 160 routed experts per layer)
  • Quantization: NVFP4 (4-bit float, group_size=16, blockwise scales)
  • Calibration: 512 samples from BAAI/Infinity-Instruct, max sequence length 8192
  • Quantization tool: NVIDIA ModelOpt
  • Attention layers: kept in BF16 (not quantized)
  • Checkpoint format: pre-packed uint8 weights + blockwise float8_e4m3fn scales, directly loadable by SGLang and vLLM

Compatibility

Framework Version Status
SGLang 0.4+ βœ… Tested
vLLM 0.16+ βœ… Tested
CUDA SM 100a (B200) βœ… Tested
CUDA SM 89 (L40S / RTX 4090) βœ… Tested (15.1)

Usage

SGLang

python3 -m sglang.launch_server \
  --model ahanley22/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tp 8 \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.88 \
  --host 0.0.0.0 \
  --port 8000

vLLM

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve ahanley22/GLM-4.6-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --quantization modelopt_fp4 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000

Inference (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="ahanley22/GLM-4.6-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
    temperature=1.0,
)
print(response.choices[0].message.content)

Quantization Notes

  • MoE expert layers (layers 3–91): calibrated blockwise scales from 512-sample calibration run
  • Dense MLP layers (layers 0–2): calibrated blockwise scales
  • Attention projections: excluded from quantization, remain BF16
  • lm_head: excluded from quantization, remains BF16
  • Scale format: float8_e4m3fn, shape [out_features, in_features // 16]
  • Weight format: uint8, shape [out_features, in_features // 2] (two FP4 values per byte)

Hardware Requirements

Minimum recommended: 4Γ— H100/H200/B200 (80GB+) or equivalent for TP=4. For TP=8, 8Γ— GPUs recommended for best throughput and to avoid OOM on large batches.

About GLM-4.6

GLM-4.6 is the latest flagship model from Z.AI's GLM series with 357B total parameters in a Mixture-of-Experts architecture. Key capabilities include a 200K token context window, strong coding and reasoning performance competitive with Claude Sonnet 4, advanced tool use and agentic capabilities, and refined writing quality.

License

MIT β€” see base model license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support