GLM 5.1 Quants
Collection
3 items • Updated
INT4 AWQ quantized version of zai-org/GLM-5.1.
This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, with MLP gate/up projections quantized to INT4 for ~3x compression.
Group-wise INT4 asymmetric quantization with 5+5 edge layer protection:
| Precision | Layers |
|---|---|
| INT4 | MLP gate_proj/up_proj (256 routed experts + shared expert, layers 5-72) |
| BF16 | First 5 layers, last 5 layers, all down_proj, MLA projections, lm_head, embed_tokens, router gates, norms |
Architecture match with the BF16 source:
model_type=glm_moe_dsa78 layers (3 dense + 75 MoE, first_k_dense_replace=3)n_routed_experts=256, num_experts_per_tok=8, n_shared_experts=1max_position_embeddings=202752hidden_size=6144, moe_intermediate_size=2048vocab_size=154880vllm serve mconcat/GLM-5.1-AWQ-4bit \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--trust-remote-code
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mconcat/GLM-5.1-AWQ-4bit",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("mconcat/GLM-5.1-AWQ-4bit")
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.19.0 | Yes | Uniform INT4 on gate/up, BF16 down - fused MoE compatible |
| SGLang >= 0.5.10 | Partial | compressed-tensors INT4 support varies |
| transformers >= 5.4.0 | Yes | Direct loading with device_map="auto" |
num_nextn_predict_layers=1 in config.If running on Blackwell workstation GPUs (SM 12.0), see GLM-5.1-FP8-Dynamic README for vLLM patch instructions.
Base model
zai-org/GLM-5.1