--- language: - en - zh library_name: mlx license: mit pipeline_tag: text-generation tags: - mlx base_model: zai-org/GLM-5.2 --- [zai-org/GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) optimized for running on a Mac Studio M3 512. - A mixed-precision quant that balances speed, memory, and accuracy. - 4-bit baseline with important layers at higher precision. - Fits into ~420 GB memory, leaving enough room for a smaller utility model. # Usage NOTE: Run with https://github.com/ml-explore/mlx-lm/pull/1410 until the PR is merged. ```sh # Start server at http://localhost:8080/v1/chat/completions uvx --from mlx-lm mlx_lm.server \ --host 127.0.0.1 \ --port 8080 \ --model spicyneuron/GLM-5.2-MLX-4.5bit ``` # Benchmarks metric | this model --- | --- bpw | 4.535 base memory | 392.454 peak memory (1024/512) | 422.787 prompt tok/s (1024) | 194.114 ± 0.079 gen tok/s (512) | 17.781 ± 0.028 kl mean\* | 0.049 ± 0.002 kl p95 | 0.113 ± 0.002 perplexity | 4.642 ± 0.036 arc_challenge | 0.690 ± 0.021 hellaswag | 0.780 ± 0.019 \* KL calculated against the largest quant I could run locally (~5.3 bit). Real KL is against FP will be higher. # Methodology Quantized with a [mlx-lm fork](https://github.com/spicyneuron/mlx-lm/tree/override). MLX quantization options differ than llama.cpp, but the principles are the same: - Sensitive layers like MoE routing, attention, and output embeddings get higher precision - More tolerant layers like MoE experts get lower precision