GLM 5.1 optimized to run comfortably on a Mac Studio M3 512. This is the quality-first version. Smaller, compact version here.

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 3-bit baseline with important layers at 4, 8 and BF16.
  • Fits into ~350 GB memory, leaving plenty of room to run parallel models (ex: Minimax M2.7, Qwen 3.6 35B).

Usage

# Start server at http://localhost:8080/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/GLM-5.1-MLX-3.6bit

Benchmarks

metric baa-ai/GLM-5.1-RAM-270GB-MLX 2.9 bit 3.6 bit (this model)
bpw 3.110 2.906 3.645
base memory 269.303 251.702 315.648
peak memory (1024/512) 291.257 272.358 341.020
prompt tok/s (1024) 194.958 卤 0.075 194.216 卤 0.167 190.508 卤 0.880
gen tok/s (512) 21.381 卤 0.050 19.527 卤 0.035 17.873 卤 0.156
kl mean 0.686 卤 0.054 0.268 卤 0.009 0.117 卤 0.004
kl p95 1.478 卤 0.054 0.537 卤 0.009 0.236 卤 0.004
perplexity 4.780 卤 0.020 4.118 卤 0.016 3.945 卤 0.016
piqa 0.776 卤 0.010 0.794 卤 0.009 0.820 卤 0.017

Tested on a Mac Studio M3 Ultra with:

mlx_lm.kld --baseline-model path/to/mlx-full-precision
mlx_lm.perplexity --sequence-length 2048 --seed 123
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 500

Note:

  • mlx_lm.kld is approximate, based on top_k not full logits. Here's the code.
  • GLM 5.1 KL divergence calculated against the largest quant I could run locally (~495 GB), so real KL is higher.

Methodology

Quantized with a mlx-lm fork, drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ from llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
719
Safetensors
Model size
744B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/GLM-5.1-MLX-3.6bit

Base model

zai-org/GLM-5.1
Quantized
(37)
this model