Instructions to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FinnTheAI/Kimi-K2.6-SmartQuant-GGUF",
	filename="Kimi-K2.6-SmartQuant-V2.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

Use Docker

docker model run hf.co/FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

LM Studio
Jan

vLLM

How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FinnTheAI/Kimi-K2.6-SmartQuant-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FinnTheAI/Kimi-K2.6-SmartQuant-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

Ollama
How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with Ollama:
```
ollama run hf.co/FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16
```

Unsloth Studio new

How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FinnTheAI/Kimi-K2.6-SmartQuant-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FinnTheAI/Kimi-K2.6-SmartQuant-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for FinnTheAI/Kimi-K2.6-SmartQuant-GGUF to start chatting

Pi new

How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Kimi-K2.6-SmartQuant-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with Docker Model Runner:
```
docker model run hf.co/FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16
```

Lemonade

How to use FinnTheAI/Kimi-K2.6-SmartQuant-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FinnTheAI/Kimi-K2.6-SmartQuant-GGUF:F16

Run and chat with the model

lemonade run user.Kimi-K2.6-SmartQuant-GGUF-F16

List all available models

lemonade list

Kimi-K2.6-SmartQuant-V2 GGUF

FinnTheAI custom imatrix-guided mixed-precision quantization of moonshotai/Kimi-K2-Instruct

V2 improves on V1 (352GB uniform Q4_K_M) with a 10% size reduction (320GB) and full imatrix calibration — importance-weighted quantization across all 1,096 tensors guided by 789 calibration data points from 100 chunks of representative text.

Model Files

File	Size	Description
`Kimi-K2.6-SmartQuant-V2.gguf`	320 GB	Main language model — imatrix mixed-precision
`mmproj-F16.gguf`	908 MB	MoonViT vision encoder (F16, sourced from unsloth/Kimi-K2.6-GGUF)

Quantization Strategy

Mixed-precision design targeting the DeepSeek-V2 / MLA architecture. Each component type is assigned a quantization level based on its sensitivity to precision loss:

Component	Quant	Rationale
Routed expert FFN (`ffn_*_exps`, 384×)	Q2_K	Only 8/384 experts fire per token — highly sparse, tolerant of aggressive quantization
Shared expert FFN (`ffn_*_shexp`)	Q6_K	Active on every token — requires higher precision
MLA value projections (`attn_v_b`, `attn_kv_a_mqa`)	Q8_0	Most sensitive attention weights; quality-critical
MLA attention (`attn_q_a/b`, `attn_kv_a/b`, `attn_output`)	Q6_K	Latent compression layer; moderate sensitivity
Output (`lm_head`, `token_embd`)	Q8_0	Directly affects output distribution
Boundary layers (`blk.0–2`, `blk.58–60`)	Q6_K	First/last decoder layers empirically more sensitive
Router/gate (`ffn_gate_inp`, `exp_probs_b`)	F32	Tiny tensors, routing-critical — kept full precision
All other tensors	Q4_K_M	Base quantization
Vision encoder (MoonViT 400M)	F16	Separate `mmproj-F16.gguf` file

Effective bits per weight: ~4.5 bpw (text model)

Note on the quantize command: The Q4_K_M argument to llama-quantize sets the fallback type for any tensor not explicitly listed in --tensor-type-file. Since all 1,096 tensors in this model are covered by the tensor type file, the positional argument is effectively unused — every tensor's quant is determined by the file above.

Comparison to V1

	V1	V2
Strategy	Uniform Q4_K_M	Mixed-precision per component
imatrix	No	Yes (789 entries, 100 chunks)
Size	352 GB	320 GB
Routed experts	Q4_K_M	Q2_K
Shared expert	Q4_K_M	Q6_K
MLA attention	Q4_K_M	Q6_K / Q8_0
Output/embed	Q4_K_M	Q8_0

imatrix Calibration

The importance matrix was computed from the BF16 source model using llama-imatrix on an x86 CPU server:

Property	Value
Source model	Kimi-K2.6 BF16 GGUF shards (46 × ~44 GB)
Calibration data	Diverse English + code + reasoning text
Chunks	100
Entries in imatrix	789
imatrix file size	1.53 GB
Compute time	~5 days (CPU-only, 64 threads)

The imatrix guides llama-quantize to allocate more precision to tensor elements that have the highest impact on output quality, independent of the per-component quant type decisions above.

Build Details

Property	Value
Source model	`moonshotai/Kimi-K2-Instruct` (BF16, 46 GGUF shards)
Quantized on	x86 server — 2× EPYC 7302, 64 threads, 251GB DDR4
llama-quantize build	commit `9d34231`, GCC 13.3.0, AVX2, 64 threads
Quantization date	2026-04-30
Quantization time	~3 hours (CPU, 64 threads)
Command	`llama-quantize --imatrix kimi_v2_bf16.imatrix --tensor-type-file tensor_types.txt <src> <dst> Q4_K_M 64`

Architecture

Base model: Kimi K2.6 (moonshotai/Kimi-K2-Instruct)

Property	Value
Architecture	DeepSeek-V2 (MoE + MLA)
Parameters	~1T total, ~32B active per token
Hidden size	7,168
Decoder layers	61
Routed experts	384 (8 active per token)
Shared experts	1
Attention	Multi-head Latent Attention (MLA)
Native context window	256K tokens (`n_ctx_train = 262144`)
Vision	MoonViT 400M (image input; video not currently supported by llama.cpp)

Usage

llama.cpp (text only)

llama-cli \
  --model Kimi-K2.6-SmartQuant-V2.gguf \
  --ctx-size 32768 \
  --temp 0.6 --top-p 0.95 \
  -p "You are a helpful assistant."

llama.cpp (with image input)

llama-cli \
  --model Kimi-K2.6-SmartQuant-V2.gguf \
  --mmproj mmproj-F16.gguf \
  --ctx-size 32768 \
  --temp 0.6 --top-p 0.95

llama-server (API use)

llama-server \
  --model Kimi-K2.6-SmartQuant-V2.gguf \
  --ctx-size 131072 \
  --n-predict 4096 \
  -ngl 999 \
  --flash-attn \
  --cont-batching \
  --port 8080

-np (parallel sequences): Splits your total KV cache budget across N concurrent request slots. Total KV memory = ctx_size × n_parallel, allocated upfront regardless of actual load. -np 1 --ctx-size 524288 costs the same KV memory as -np 4 --ctx-size 131072 — the tradeoff is one large context vs. four smaller concurrent ones. For single-user local inference, -np 1 with the largest context you can afford is usually optimal.

Recommended hardware: 384GB+ unified or VRAM for full GPU offload. Can run CPU-only with sufficient RAM (~350GB+) at ~2 tok/s.

Memory Usage

⏳ Real-world measurements pending — will be updated once the model is fully benchmarked under load.

Estimated at 131K context (-np 1): ~330GB model weights + ~35GB KV cache ≈ ~365GB total.

Benchmark Results

⏳ Full benchmark suite in progress — results will be updated when complete. Evaluated using lm-evaluation-harness.

Open LLM Leaderboard v2

Task	Metric	V1	V2
IFEval	prompt_level_strict_acc	pending	pending
BBH	normalized_acc	pending	pending
MATH Level 5	exact_match	pending	pending
GPQA	acc_norm	pending	pending
MuSR	acc_norm	pending	pending
MMLU-Pro	acc	pending	pending

Classic Benchmarks

Task	Metric	V1	V2
ARC Challenge (25-shot)	acc_norm	pending	pending
ARC Easy (25-shot)	acc_norm	pending	pending
HellaSwag (10-shot)	acc_norm	pending	pending
TruthfulQA MC2 (0-shot)	acc	pending	pending
GSM8K (5-shot)	flexible-extract	pending	pending
MMLU (5-shot CoT)	flexible-extract	pending	pending

Notes

Video input is not currently supported by llama.cpp (no upstream video pipeline); image input works via mmproj-F16.gguf
ffn_gate_inp and exp_probs_b tensors are kept at F32 — do not re-quantize these
For extended context beyond 256K, use YaRN: --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144

Attribution

This is a quantized derivative of Kimi K2.6 by Moonshot AI.

Original model: moonshotai/Kimi-K2-Instruct
Quantization by: FinnTheAI

Changes Made

This repository contains a modified version of the original Kimi-K2-Instruct model weights. The following changes were made:

Quantization: Model weights converted from BF16 to mixed-precision GGUF format using llama-quantize with an importance matrix (imatrix). Different tensor types received different quantization levels (Q2_K through Q8_0 and F32) based on architectural sensitivity — see Quantization Strategy table above.
Format change: Converted from PyTorch/safetensors shards to single-file GGUF format compatible with llama.cpp.
Vision encoder: mmproj-F16.gguf sourced separately from unsloth/Kimi-K2.6-GGUF and included as-is (F16, unmodified).

No fine-tuning, RLHF, or changes to model behavior were made. This is a precision-reduction of the original weights only.

License

This derivative work is released under the same Modified MIT License as the original model. Full license text (as required by MIT):

Modified MIT License

Copyright (c) 2025 Moonshot AI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Our only modification part is that, if the Software (or any derivative works
thereof) is used for any of your commercial products or services that have
more than 100 million monthly active users, or more than 20 million US dollars
(or equivalent in other currencies) in monthly revenue, you shall prominently
display "Kimi K2" on the user interface of such product or service.

Quantized by FinnTheAI · 2026-04-30

Downloads last month: 730

GGUF

Model size

1T params

Architecture

deepseek2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for FinnTheAI/Kimi-K2.6-SmartQuant-GGUF

Base model

moonshotai/Kimi-K2-Instruct

Quantized

(19)

this model