Instructions to use spicyneuron/Kimi-K2.5-MLX-2.5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("spicyneuron/Kimi-K2.5-MLX-2.5bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Kimi-K2.5-MLX-2.5bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "spicyneuron/Kimi-K2.5-MLX-2.5bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Kimi-K2.5-MLX-2.5bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default spicyneuron/Kimi-K2.5-MLX-2.5bit
Run Hermes
hermes
- MLX LM
How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "spicyneuron/Kimi-K2.5-MLX-2.5bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "spicyneuron/Kimi-K2.5-MLX-2.5bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spicyneuron/Kimi-K2.5-MLX-2.5bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Quantization Method
Hey there,
Can you talk about your quantization method? I am curious to learn more about the mixed precision setup. Can you provide any code examples?
Thanks!
Of course! My earliest attempts "cloned" exact settings from Unsloth GGUFs (https://github.com/spicyneuron/gguf-clone), and I've also tried empirical analysis to identify sensitive weights (similar to https://github.com/baa-ai/MINT).
But I eventually found that uploading config.json and safetensors.index.json to Claude / Codex, and then discussing model architecture, arrives at 90%+ of the same recommendations.
The typical process is:
- Create a "maximized" 4-bit version where all potentially sensitive layers are BF16
- Incrementally drop precision (BF16 → 8 → 6) for each trial
- Look for a clear best size / speed / quality tradeoff
For evaluation, I run 500 tasks each from hellaswag, piqa, and winogrande, as well as perplexity and throughput tests:
mlx_lm.benchmark --model "$model" --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.perplexity --model "$model" --sequence-length 1024 --seed 123
mlx_lm.evaluate --model "$model" --task hellaswag --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task piqa --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task winogrande --seed 123 --limit 500
And depending on model size, I'll sometimes pick both a "best tradeoff" and a "smallest acceptable" version.
Still awaiting maintainer review, but here's the mlx-lm PR for granular quantization overrides: https://github.com/ml-explore/mlx-lm/pull/922
Let me know if you end up trying this yourself!
Thanks for such a descriptive reply! A very clear style of communication and the iterative process to find the best model? So scientific! I love it. I won’t pretend to understand it all, but I appreciate your thoroughness and willingness to reply.
The concept makes sense: Preserve the most important layers at full precision, and then use mixed precision for the others to find the best balance of size and quality.
I’d definitely love to give it a try sometime. I’m very new to quantization so this helps and is encouraging. I’m hoping to gain more mlx memory / power when the new mac studios are released so that I can run these huge models at decent speeds.