YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
GLM-4.6 NVFP4
NVFP4 quantization of zai-org/GLM-4.6, produced using NVIDIA ModelOpt.
Model Details
- Base model: zai-org/GLM-4.6
- Architecture: Mixture-of-Experts, 357B total parameters (92 transformer layers, 160 routed experts per layer)
- Quantization: NVFP4 (4-bit float, group_size=16, blockwise scales)
- Calibration: 512 samples from BAAI/Infinity-Instruct, max sequence length 8192
- Quantization tool: NVIDIA ModelOpt
- Attention layers: kept in BF16 (not quantized)
- Checkpoint format: pre-packed uint8 weights + blockwise float8_e4m3fn scales, directly loadable by SGLang and vLLM
Compatibility
| Framework | Version | Status |
|---|---|---|
| SGLang | 0.4+ | β Tested |
| vLLM | 0.16+ | β Tested |
| CUDA SM | 100a (B200) | β Tested |
| CUDA SM | 89 (L40S / RTX 4090) | β Tested (15.1) |
Usage
SGLang
python3 -m sglang.launch_server \
--model ahanley22/GLM-4.6-NVFP4 \
--trust-remote-code \
--tp 8 \
--quantization modelopt_fp4 \
--attention-backend flashinfer \
--moe-runner-backend flashinfer_cutlass \
--kv-cache-dtype fp8_e4m3 \
--mem-fraction-static 0.88 \
--host 0.0.0.0 \
--port 8000
vLLM
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve ahanley22/GLM-4.6-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 8 \
--quantization modelopt_fp4 \
--dtype bfloat16 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000
Inference (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="ahanley22/GLM-4.6-NVFP4",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=512,
temperature=1.0,
)
print(response.choices[0].message.content)
Quantization Notes
- MoE expert layers (layers 3β91): calibrated blockwise scales from 512-sample calibration run
- Dense MLP layers (layers 0β2): calibrated blockwise scales
- Attention projections: excluded from quantization, remain BF16
- lm_head: excluded from quantization, remains BF16
- Scale format:
float8_e4m3fn, shape[out_features, in_features // 16] - Weight format:
uint8, shape[out_features, in_features // 2](two FP4 values per byte)
Hardware Requirements
Minimum recommended: 4Γ H100/H200/B200 (80GB+) or equivalent for TP=4. For TP=8, 8Γ GPUs recommended for best throughput and to avoid OOM on large batches.
About GLM-4.6
GLM-4.6 is the latest flagship model from Z.AI's GLM series with 357B total parameters in a Mixture-of-Experts architecture. Key capabilities include a 200K token context window, strong coding and reasoning performance competitive with Claude Sonnet 4, advanced tool use and agentic capabilities, and refined writing quality.
License
MIT β see base model license.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support