--- language: en pipeline_tag: text-generation library_name: mlx base_model: - moonshotai/Kimi-K2.7-Code base_model_relation: quantized tags: - mlx --- [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) optimized for running on a Mac Studio M3 Ultra. - A mixed-precision quant that balances speed, memory, and accuracy. - 3-bit MoE baseline with important always-on layers at higher precision. - Fits into ~460 GB memory, leaving enough room for a smaller utility model. # Usage ```sh # Start server at http://localhost:8080/v1/chat/completions uvx --from mlx-lm mlx_lm.server \ --host 127.0.0.1 \ --port 8080 \ --model spicyneuron/Kimi-K2.7-Code-MLX-3.6bit ``` # Benchmarks metric | this model --- | --- bpw | 3.578 base memory | 427.579 peak memory (1024/512) | 460.444 prompt tok/s (1024) | 218.851 ± 0.208 gen tok/s (512) | 21.035 ± 0.049 perplexity | 4.462 ± 0.037 arc_challenge | 0.692 ± 0.021 hellaswag | 0.780 ± 0.019 # Methodology Quantized with a [mlx-lm fork](https://github.com/spicyneuron/mlx-lm/tree/_tools). MLX quantization options differ than llama.cpp, but the principles are the same: - Sensitive layers like MoE routing, attention, and output embeddings get higher precision - More tolerant layers like MoE experts get lower precision