Instructions to use spicyneuron/Kimi-K2.5-MLX-2.5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("spicyneuron/Kimi-K2.5-MLX-2.5bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Kimi-K2.5-MLX-2.5bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "spicyneuron/Kimi-K2.5-MLX-2.5bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Kimi-K2.5-MLX-2.5bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default spicyneuron/Kimi-K2.5-MLX-2.5bit

Run Hermes

hermes

MLX LM

How to use spicyneuron/Kimi-K2.5-MLX-2.5bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "spicyneuron/Kimi-K2.5-MLX-2.5bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "spicyneuron/Kimi-K2.5-MLX-2.5bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "spicyneuron/Kimi-K2.5-MLX-2.5bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Quantization Method

by x-polyglot-x - opened Apr 2

Discussion

x-polyglot-x

Apr 2

Hey there,

Can you talk about your quantization method? I am curious to learn more about the mixed precision setup. Can you provide any code examples?

Thanks!

spicyneuron

Owner Apr 4

Of course! My earliest attempts "cloned" exact settings from Unsloth GGUFs (https://github.com/spicyneuron/gguf-clone), and I've also tried empirical analysis to identify sensitive weights (similar to https://github.com/baa-ai/MINT).

But I eventually found that uploading config.json and safetensors.index.json to Claude / Codex, and then discussing model architecture, arrives at 90%+ of the same recommendations.

The typical process is:

Create a "maximized" 4-bit version where all potentially sensitive layers are BF16
Incrementally drop precision (BF16 → 8 → 6) for each trial
Look for a clear best size / speed / quality tradeoff

For evaluation, I run 500 tasks each from hellaswag, piqa, and winogrande, as well as perplexity and throughput tests:

mlx_lm.benchmark --model "$model" --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.perplexity --model "$model" --sequence-length 1024 --seed 123
mlx_lm.evaluate --model "$model" --task hellaswag --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task piqa --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task winogrande --seed 123 --limit 500

And depending on model size, I'll sometimes pick both a "best tradeoff" and a "smallest acceptable" version.

Still awaiting maintainer review, but here's the mlx-lm PR for granular quantization overrides: https://github.com/ml-explore/mlx-lm/pull/922

Let me know if you end up trying this yourself!

x-polyglot-x

Apr 11

Thanks for such a descriptive reply! A very clear style of communication and the iterative process to find the best model? So scientific! I love it. I won’t pretend to understand it all, but I appreciate your thoroughness and willingness to reply.

The concept makes sense: Preserve the most important layers at full precision, and then use mixed precision for the others to find the best balance of size and quality.

I’d definitely love to give it a try sometime. I’m very new to quantization so this helps and is encouraging. I’m hoping to gain more mlx memory / power when the new mac studios are released so that I can run these huge models at decent speeds.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment