Kimi K2.6 optimized to run on a Mac Studio M3 512. This is the larger, quality-first version. Compact version here.

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 3-bit baseline with important layers at 8-bit and BF16.
  • Fits into ~460 GB memory, leaving plenty of room to run a smaller, faster utility model (ex: Qwen 3.6 35B, Gemma 4 26B).
  • This quant does not support image input.

Usage

# Start server at http://localhost:8080/v1/chat/completions
# Kimi K2.6 requires tiktoken + remote code for the tokenizer
uvx --from mlx-lm --with tiktoken \
  mlx_lm.server \
    --host 127.0.0.1 \
    --port 8080 \
    --trust-remote-code \
    --model spicyneuron/Kimi-K2.6-MLX-3.6bit

Benchmarks

metric 3.6 bit (this model) 3.3 bit
bpw 3.578 3.331
peak memory (1024/512) 460.444 428.735
prompt tok/s (1024) 221.704 卤 0.057 223.613 卤 0.098
gen tok/s (512) 21.095 卤 0.070 21.363 卤 0.035
kl mean 0.022 卤 0.001 0.051 卤 0.002
kl p95 0.053 卤 0.001 0.113 卤 0.002
perplexity 3.559 卤 0.021 3.550 卤 0.020
hellaswag 0.594 卤 0.022 0.590 卤 0.022
piqa 0.848 卤 0.016 0.852 卤 0.016
winogrande 0.670 卤 0.021 0.690 卤 0.021

Tested on a Mac Studio M3 Ultra with:

mlx_lm.kld --baseline-model path/to/mlx-full-precision
mlx_lm.perplexity --sequence-length 512 --seed 123
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 500
mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 500
mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 500

Note:

  • mlx_lm.kld is approximate, based on top_k not full logits. Here's the code.
  • Kimi K2.6 KL divergence calculated against the largest quant I could run locally (~490 GB), so real KL is higher.

Methodology

Quantized with a mlx-lm fork, drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ from llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
5,148
Safetensors
Model size
1T params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/Kimi-K2.6-MLX-3.6bit

Quantized
(30)
this model