spicyneuron's picture
Update README.md
f2f75bb verified
|
Raw
History Blame Contribute Delete
1.3 kB
---
language: en
pipeline_tag: text-generation
library_name: mlx
base_model:
- moonshotai/Kimi-K2.7-Code
base_model_relation: quantized
tags:
- mlx
---
[moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code)
optimized for running on a Mac Studio M3 Ultra.
- A mixed-precision quant that balances speed, memory, and accuracy.
- 3-bit MoE baseline with important always-on layers at higher precision.
- Fits into ~460 GB memory, leaving enough room for a smaller utility model.
# Usage
```sh
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
--host 127.0.0.1 \
--port 8080 \
--model spicyneuron/Kimi-K2.7-Code-MLX-3.6bit
```
# Benchmarks
metric | this model
--- | ---
bpw | 3.578
base memory | 427.579
peak memory (1024/512) | 460.444
prompt tok/s (1024) | 218.851 ± 0.208
gen tok/s (512) | 21.035 ± 0.049
perplexity | 4.462 ± 0.037
arc_challenge | 0.692 ± 0.021
hellaswag | 0.780 ± 0.019
# Methodology
Quantized with a [mlx-lm fork](https://github.com/spicyneuron/mlx-lm/tree/_tools).
MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision