zai-org/GLM-5.2 optimized for running on a Mac Studio M3 512.

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 4-bit baseline with important layers at higher precision.
  • Fits into ~420 GB memory, leaving enough room for a smaller utility model.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/GLM-5.2-MLX-4.5bit

Benchmarks

metric this model
bpw 4.535
base memory 392.454
peak memory (1024/512) 422.787
prompt tok/s (1024) 194.114 卤 0.079
gen tok/s (512) 17.781 卤 0.028
kl mean* 0.049 卤 0.002
kl p95 0.113 卤 0.002
perplexity 4.642 卤 0.036
arc_challenge 0.690 卤 0.021
hellaswag 0.780 卤 0.019

* KL calculated against the largest quant I could run locally (~5.3 bit). Real KL is against FP will be higher.

Methodology

Quantized with a mlx-lm fork. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
-
Safetensors
Model size
743B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/GLM-5.2-MLX-4.5bit

Base model

zai-org/GLM-5.2
Quantized
(26)
this model