GLM-5.1-AWQ-4bit

INT4 AWQ quantized version of zai-org/GLM-5.1.

This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, with MLP gate/up projections quantized to INT4 for ~3x compression.

Quantization Strategy

Group-wise INT4 asymmetric quantization with 5+5 edge layer protection:

Precision Layers
INT4 MLP gate_proj/up_proj (256 routed experts + shared expert, layers 5-72)
BF16 First 5 layers, last 5 layers, all down_proj, MLA projections, lm_head, embed_tokens, router gates, norms

Architecture match with the BF16 source:

  • model_type=glm_moe_dsa
  • 78 layers (3 dense + 75 MoE, first_k_dense_replace=3)
  • n_routed_experts=256, num_experts_per_tok=8, n_shared_experts=1
  • max_position_embeddings=202752
  • hidden_size=6144, moe_intermediate_size=2048
  • vocab_size=154880

Calibration

  • 256 calibration samples generated from GLM-5.1 via OpenRouter
  • 8 diverse categories: math, code, logic, analysis, creative writing, general knowledge, agentic/tool-calling, Korean
  • Group size 128 for INT4 quantization

Usage

vLLM

vllm serve mconcat/GLM-5.1-AWQ-4bit \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mconcat/GLM-5.1-AWQ-4bit",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("mconcat/GLM-5.1-AWQ-4bit")

Compatibility

Framework Supported Notes
vLLM >= 0.19.0 Yes Uniform INT4 on gate/up, BF16 down - fused MoE compatible
SGLang >= 0.5.10 Partial compressed-tensors INT4 support varies
transformers >= 5.4.0 Yes Direct loading with device_map="auto"

Notes

  • This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
  • Edge layer protection: first 5 and last 5 layers kept in BF16 for quality preservation.
  • INT4 provides ~3x compression vs BF16, larger than FP8 but with potential quality trade-offs.
  • GLM-5.1 does not ship MTP weights despite num_nextn_predict_layers=1 in config.

Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)

If running on Blackwell workstation GPUs (SM 12.0), see GLM-5.1-FP8-Dynamic README for vLLM patch instructions.

Quantization Process

  • Method: AWQ (Activation-aware Weight Quantization) with compressed-tensors format
  • Hardware: Single NVIDIA RTX PRO 6000 Blackwell (96 GB), processed one layer at a time
  • Time: ~150 minutes for 78 layers
  • Calibration: 256 samples with per-group scale computation
Downloads last month
-
Safetensors
Model size
233B params
Tensor type
BF16
·
I64
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/GLM-5.1-AWQ-4bit

Base model

zai-org/GLM-5.1
Quantized
(24)
this model

Collection including mconcat/GLM-5.1-AWQ-4bit