GLM-5.1-Alis-MLX-Dynamic-2.7bpw
Unsloth Dynamic 2.0 Style per-tensor mixed-precision Quantization of zai-org/GLM-5.1 for Apple Silicon via MLX.
| Metric | Value |
|---|---|
| Base model | zai-org/GLM-5.1 (754B MoE, 40B active) |
| Bits per weight | 2.681 |
| Peak memory | 249 GB |
| Generation speed | 18.1 tok/s (M3 Ultra 512GB) |
| Quantization time | 3.9 minutes |
| Format | MLX safetensors |
| License | MIT (same as base model) |
Why This Model
GLM-5.1 is the #1 open-source model on SWE-Bench Pro (58.4) and excels at long-horizon agentic coding tasks. However, at 1.5TB in BF16, it requires quantization to run locally.
This quantization applies Unsloth Dynamic 2.0-style per-tensor bit allocation natively in MLX:
- Critical layers (embeddings, attention, routers) get higher precision (5-8 bit)
- MoE expert weights (93%+ of parameters) get aggressive compression (2-3 bit)
- Result: near-Unsloth quality at 1.5× faster speed than GGUF on Apple Silicon
Comparison
| Unsloth GGUF IQ2_M | This model (MLX) | MLX uniform 4bit* | |
|---|---|---|---|
| BPW | ~2.7 | 2.681 | ~4.5 |
| Speed (M3 Ultra) | ~12 tok/s | 18.1 tok/s | 15.4 tok/s* |
| Memory | ~255 GB | 249 GB | ~420 GB |
| Format | GGUF (llama.cpp) | MLX native | MLX native |
*MLX uniform 4bit speed is from Awni Hannun's benchmark on GLM-5 (not 5.1). Actual GLM-5.1 uniform 4bit speed may differ slightly.
Quantization Recipe
Per-tensor mixed-precision using mlx_lm.convert() with custom quant_predicate:
| Tensor Category | Normal Layers | Sensitive Layers (0-4, 73-77) |
|---|---|---|
| embed_tokens | 8-bit | 8-bit |
| lm_head | 6-bit | 6-bit |
| attention q/k/v | 5-bit | 6-bit |
| attention o_proj | 6-bit | 8-bit |
| DSA indexer | 6-bit | 6-bit |
| MoE router (gate) | 8-bit | 8-bit |
| MoE routed experts | 2-bit | 3-bit |
| Shared experts | 3-bit | 4-bit |
| Dense MLP (L0-2) | 4-bit | 5-bit |
Key design decisions:
- No
isinstance()filter — MoE experts areDeepseekV32MoEmodules, notnn.Linear - First/last 5 layers boosted — 10× more sensitive (consistent with oQ, JANG, Moonglade findings)
- MoE router always 8-bit — routing accuracy = overall model quality
- o_proj higher precision — no preceding norm layer, AWQ correction impossible
Usage
mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw")
messages = [{"role": "user", "content": "Write a Python snake game using pygame"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2000, verbose=True)
mlx-lm server (OpenAI-compatible API)
python3 -m mlx_lm.server --model avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw --port 1237
oMLX
Place in your oMLX models directory and refresh. Recommended settings:
- Reasoning Parser:
deepseek_r1 - Temperature: 0.7, Top P: 0.95
- TurboQuant KV Cache: ON (4-bit)
- Index Cache: ON (DSA optimization)
- CTX Window: 131072
- Max Tokens: 8192
- SpecPrefill: OFF (no GLM draft model available)
Hardware Requirements
| Config | Works? | Notes |
|---|---|---|
| M3 Ultra 512GB | ✅ | 249GB used, 260GB headroom |
| M3 Ultra 192GB | ❌ | Not enough memory |
| M4 Max 128GB | ❌ | Not enough memory |
Quantization Reproduction
from collections import Counter
from mlx_lm import convert
import mlx.nn as nn
NUM_LAYERS = 78
SENSITIVE = set(range(5)) | set(range(73, 78))
stats = Counter()
def predicate(path, module):
if "norm" in path.lower():
return False
layer_num = None
parts = path.split(".")
for j, p in enumerate(parts):
if p == "layers" and j + 1 < len(parts):
try:
layer_num = int(parts[j + 1])
except ValueError:
pass
sensitive = layer_num is not None and layer_num in SENSITIVE
if "embed_tokens" in path: bits = 8
elif "lm_head" in path: bits = 6
elif "q_a_proj" in path or "q_b_proj" in path:
bits = 6 if sensitive else 5
elif "kv_a_proj_with_mqa" in path: bits = 6 if sensitive else 5
elif "o_proj" in path: bits = 8 if sensitive else 6
elif "indexer" in path: bits = 6
elif "mlp" in path and "shared_experts" not in path and layer_num is not None and layer_num >= 3:
bits = 3 if sensitive else 2
elif "shared_experts" in path: bits = 4 if sensitive else 3
elif "mlp" in path and layer_num is not None and layer_num < 3:
bits = 5 if sensitive else 4
else: bits = 4
stats[bits] += 1
return {"bits": bits, "group_size": 64}
convert(
hf_path="zai-org/GLM-5.1",
mlx_path="./GLM-5.1-Alis-MLX-Dynamic-2.7bpw",
quantize=True, q_group_size=64, q_bits=4,
quant_predicate=predicate,
)
Credits
- Z.ai / Zhipu AI — GLM-5.1 base model (MIT license)
- Unsloth — Dynamic 2.0 per-tensor quantization methodology
- Apple MLX — Framework and
quant_predicateAPI - Moonglade/Brooooooklyn — Unsloth→MLX porting precedent
- oMLX/oQ — MoE-aware quantization insights
- JANG — MLX mixed-precision quantization reference
Citation
@misc{glm5.1-alis-mlx,
title={GLM-5.1-Alis-MLX-Dynamic-2.7bpw: Unsloth-style per-tensor quantization for Apple Silicon},
author={Alis (avlp12)},
year={2026},
url={https://huggingface.co/avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw}
}
- Downloads last month
- 1,163
Model size
744B params
Tensor type
BF16
·
U32 ·
F32 ·
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw
Base model
zai-org/GLM-5.1