GLM-5.2-MLX-4.5bit / README.md
spicyneuron's picture
Update README.md
db00e13 verified
|
Raw
History Blame Contribute Delete
1.49 kB
---
language:
- en
- zh
library_name: mlx
license: mit
pipeline_tag: text-generation
tags:
- mlx
base_model: zai-org/GLM-5.2
---
[zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) optimized for running on a Mac Studio M3 512.
- A mixed-precision quant that balances speed, memory, and accuracy.
- 4-bit baseline with important layers at higher precision.
- Fits into ~420 GB memory, leaving enough room for a smaller utility model.
# Usage
NOTE: Run with https://github.com/ml-explore/mlx-lm/pull/1410 until the PR is merged.
```sh
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
--host 127.0.0.1 \
--port 8080 \
--model spicyneuron/GLM-5.2-MLX-4.5bit
```
# Benchmarks
metric | this model
--- | ---
bpw | 4.535
base memory | 392.454
peak memory (1024/512) | 422.787
prompt tok/s (1024) | 194.114 ± 0.079
gen tok/s (512) | 17.781 ± 0.028
kl mean\* | 0.049 ± 0.002
kl p95 | 0.113 ± 0.002
perplexity | 4.642 ± 0.036
arc_challenge | 0.690 ± 0.021
hellaswag | 0.780 ± 0.019
\* KL calculated against the largest quant I could run locally (~5.3 bit). Real KL is against FP will be higher.
# Methodology
Quantized with a [mlx-lm fork](https://github.com/spicyneuron/mlx-lm/tree/override).
MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision