GLM-5.1-Alis-MLX-Dynamic-2.7bpw

Unsloth Dynamic 2.0 Style per-tensor mixed-precision Quantization of zai-org/GLM-5.1 for Apple Silicon via MLX.

Metric Value
Base model zai-org/GLM-5.1 (754B MoE, 40B active)
Bits per weight 2.681
Peak memory 249 GB
Generation speed 18.1 tok/s (M3 Ultra 512GB)
Quantization time 3.9 minutes
Format MLX safetensors
License MIT (same as base model)

Why This Model

GLM-5.1 is the #1 open-source model on SWE-Bench Pro (58.4) and excels at long-horizon agentic coding tasks. However, at 1.5TB in BF16, it requires quantization to run locally.

This quantization applies Unsloth Dynamic 2.0-style per-tensor bit allocation natively in MLX:

  • Critical layers (embeddings, attention, routers) get higher precision (5-8 bit)
  • MoE expert weights (93%+ of parameters) get aggressive compression (2-3 bit)
  • Result: near-Unsloth quality at 1.5× faster speed than GGUF on Apple Silicon

Comparison

Unsloth GGUF IQ2_M This model (MLX) MLX uniform 4bit*
BPW ~2.7 2.681 ~4.5
Speed (M3 Ultra) ~12 tok/s 18.1 tok/s 15.4 tok/s*
Memory ~255 GB 249 GB ~420 GB
Format GGUF (llama.cpp) MLX native MLX native

*MLX uniform 4bit speed is from Awni Hannun's benchmark on GLM-5 (not 5.1). Actual GLM-5.1 uniform 4bit speed may differ slightly.

Quantization Recipe

Per-tensor mixed-precision using mlx_lm.convert() with custom quant_predicate:

Tensor Category Normal Layers Sensitive Layers (0-4, 73-77)
embed_tokens 8-bit 8-bit
lm_head 6-bit 6-bit
attention q/k/v 5-bit 6-bit
attention o_proj 6-bit 8-bit
DSA indexer 6-bit 6-bit
MoE router (gate) 8-bit 8-bit
MoE routed experts 2-bit 3-bit
Shared experts 3-bit 4-bit
Dense MLP (L0-2) 4-bit 5-bit

Key design decisions:

  • No isinstance() filter — MoE experts are DeepseekV32MoE modules, not nn.Linear
  • First/last 5 layers boosted — 10× more sensitive (consistent with oQ, JANG, Moonglade findings)
  • MoE router always 8-bit — routing accuracy = overall model quality
  • o_proj higher precision — no preceding norm layer, AWQ correction impossible

Usage

mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw")

messages = [{"role": "user", "content": "Write a Python snake game using pygame"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, max_tokens=2000, verbose=True)

mlx-lm server (OpenAI-compatible API)

python3 -m mlx_lm.server --model avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw --port 1237

oMLX

Place in your oMLX models directory and refresh. Recommended settings:

  • Reasoning Parser: deepseek_r1
  • Temperature: 0.7, Top P: 0.95
  • TurboQuant KV Cache: ON (4-bit)
  • Index Cache: ON (DSA optimization)
  • CTX Window: 131072
  • Max Tokens: 8192
  • SpecPrefill: OFF (no GLM draft model available)

Hardware Requirements

Config Works? Notes
M3 Ultra 512GB 249GB used, 260GB headroom
M3 Ultra 192GB Not enough memory
M4 Max 128GB Not enough memory

Quantization Reproduction

from collections import Counter
from mlx_lm import convert
import mlx.nn as nn

NUM_LAYERS = 78
SENSITIVE = set(range(5)) | set(range(73, 78))
stats = Counter()

def predicate(path, module):
    if "norm" in path.lower():
        return False

    layer_num = None
    parts = path.split(".")
    for j, p in enumerate(parts):
        if p == "layers" and j + 1 < len(parts):
            try:
                layer_num = int(parts[j + 1])
            except ValueError:
                pass

    sensitive = layer_num is not None and layer_num in SENSITIVE

    if "embed_tokens" in path:          bits = 8
    elif "lm_head" in path:             bits = 6
    elif "q_a_proj" in path or "q_b_proj" in path:
                                         bits = 6 if sensitive else 5
    elif "kv_a_proj_with_mqa" in path:  bits = 6 if sensitive else 5
    elif "o_proj" in path:              bits = 8 if sensitive else 6
    elif "indexer" in path:             bits = 6
    elif "mlp" in path and "shared_experts" not in path and layer_num is not None and layer_num >= 3:
                                         bits = 3 if sensitive else 2
    elif "shared_experts" in path:      bits = 4 if sensitive else 3
    elif "mlp" in path and layer_num is not None and layer_num < 3:
                                         bits = 5 if sensitive else 4
    else:                                bits = 4

    stats[bits] += 1
    return {"bits": bits, "group_size": 64}

convert(
    hf_path="zai-org/GLM-5.1",
    mlx_path="./GLM-5.1-Alis-MLX-Dynamic-2.7bpw",
    quantize=True, q_group_size=64, q_bits=4,
    quant_predicate=predicate,
)

Credits

Citation

@misc{glm5.1-alis-mlx,
    title={GLM-5.1-Alis-MLX-Dynamic-2.7bpw: Unsloth-style per-tensor quantization for Apple Silicon},
    author={Alis (avlp12)},
    year={2026},
    url={https://huggingface.co/avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw}
}
Downloads last month
1,163
Safetensors
Model size
744B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avlp12/GLM-5.1-Alis-MLX-Dynamic-2.7bpw

Base model

zai-org/GLM-5.1
Quantized
(22)
this model