GLM-5.1-AWQ-4bit

INT4 AWQ quantized version of zai-org/GLM-5.1.

This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, with MLP gate/up projections quantized to INT4 for ~3x compression.

Quantization Strategy

Group-wise INT4 asymmetric quantization with 5+5 edge layer protection:

Precision	Layers
INT4	MLP `gate_proj`/`up_proj` (256 routed experts + shared expert, layers 5-72)
BF16	First 5 layers, last 5 layers, all `down_proj`, MLA projections, `lm_head`, `embed_tokens`, router gates, norms

Architecture match with the BF16 source:

model_type=glm_moe_dsa
78 layers (3 dense + 75 MoE, first_k_dense_replace=3)
n_routed_experts=256, num_experts_per_tok=8, n_shared_experts=1
max_position_embeddings=202752
hidden_size=6144, moe_intermediate_size=2048
vocab_size=154880

Calibration

256 calibration samples generated from GLM-5.1 via OpenRouter
8 diverse categories: math, code, logic, analysis, creative writing, general knowledge, agentic/tool-calling, Korean
Group size 128 for INT4 quantization

Usage

vLLM

vllm serve mconcat/GLM-5.1-AWQ-4bit \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mconcat/GLM-5.1-AWQ-4bit",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("mconcat/GLM-5.1-AWQ-4bit")

Compatibility

Framework	Supported	Notes
vLLM >= 0.19.0	Yes	Uniform INT4 on gate/up, BF16 down - fused MoE compatible
SGLang >= 0.5.10	Partial	compressed-tensors INT4 support varies
transformers >= 5.4.0	Yes	Direct loading with `device_map="auto"`

Notes

This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
Edge layer protection: first 5 and last 5 layers kept in BF16 for quality preservation.
INT4 provides ~3x compression vs BF16, larger than FP8 but with potential quality trade-offs.
GLM-5.1 does not ship MTP weights despite num_nextn_predict_layers=1 in config.

Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)

If running on Blackwell workstation GPUs (SM 12.0), see GLM-5.1-FP8-Dynamic README for vLLM patch instructions.

Quantization Process

Method: AWQ (Activation-aware Weight Quantization) with compressed-tensors format
Hardware: Single NVIDIA RTX PRO 6000 Blackwell (96 GB), processed one layer at a time
Time: ~150 minutes for 78 layers
Calibration: 256 samples with per-group scale computation

Downloads last month: -

Safetensors

Model size

233B params

Tensor type

BF16

I64

I32

Model tree for mconcat/GLM-5.1-AWQ-4bit

Base model

zai-org/GLM-5.1

Quantized

(24)

this model

Collection including mconcat/GLM-5.1-AWQ-4bit

GLM 5.1 Quants

Collection

3 items • Updated 1 day ago