How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FinnTheAI/Kimi-K2.6-SmartQuant-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Kimi-K2.6-SmartQuant-V2 GGUF

FinnTheAI custom imatrix-guided mixed-precision quantization of moonshotai/Kimi-K2-Instruct

V2 improves on V1 (352GB uniform Q4_K_M) with a 10% size reduction (320GB) and full imatrix calibration — importance-weighted quantization across all 1,096 tensors guided by 789 calibration data points from 100 chunks of representative text.


Model Files

File Size Description
Kimi-K2.6-SmartQuant-V2.gguf 320 GB Main language model — imatrix mixed-precision
mmproj-F16.gguf 908 MB MoonViT vision encoder (F16, sourced from unsloth/Kimi-K2.6-GGUF)

Quantization Strategy

Mixed-precision design targeting the DeepSeek-V2 / MLA architecture. Each component type is assigned a quantization level based on its sensitivity to precision loss:

Component Quant Rationale
Routed expert FFN (ffn_*_exps, 384×) Q2_K Only 8/384 experts fire per token — highly sparse, tolerant of aggressive quantization
Shared expert FFN (ffn_*_shexp) Q6_K Active on every token — requires higher precision
MLA value projections (attn_v_b, attn_kv_a_mqa) Q8_0 Most sensitive attention weights; quality-critical
MLA attention (attn_q_a/b, attn_kv_a/b, attn_output) Q6_K Latent compression layer; moderate sensitivity
Output (lm_head, token_embd) Q8_0 Directly affects output distribution
Boundary layers (blk.0–2, blk.58–60) Q6_K First/last decoder layers empirically more sensitive
Router/gate (ffn_gate_inp, exp_probs_b) F32 Tiny tensors, routing-critical — kept full precision
All other tensors Q4_K_M Base quantization
Vision encoder (MoonViT 400M) F16 Separate mmproj-F16.gguf file

Effective bits per weight: ~4.5 bpw (text model)

Note on the quantize command: The Q4_K_M argument to llama-quantize sets the fallback type for any tensor not explicitly listed in --tensor-type-file. Since all 1,096 tensors in this model are covered by the tensor type file, the positional argument is effectively unused — every tensor's quant is determined by the file above.

Comparison to V1

V1 V2
Strategy Uniform Q4_K_M Mixed-precision per component
imatrix No Yes (789 entries, 100 chunks)
Size 352 GB 320 GB
Routed experts Q4_K_M Q2_K
Shared expert Q4_K_M Q6_K
MLA attention Q4_K_M Q6_K / Q8_0
Output/embed Q4_K_M Q8_0

imatrix Calibration

The importance matrix was computed from the BF16 source model using llama-imatrix on an x86 CPU server:

Property Value
Source model Kimi-K2.6 BF16 GGUF shards (46 × ~44 GB)
Calibration data Diverse English + code + reasoning text
Chunks 100
Entries in imatrix 789
imatrix file size 1.53 GB
Compute time ~5 days (CPU-only, 64 threads)

The imatrix guides llama-quantize to allocate more precision to tensor elements that have the highest impact on output quality, independent of the per-component quant type decisions above.


Build Details

Property Value
Source model moonshotai/Kimi-K2-Instruct (BF16, 46 GGUF shards)
Quantized on x86 server — 2× EPYC 7302, 64 threads, 251GB DDR4
llama-quantize build commit 9d34231, GCC 13.3.0, AVX2, 64 threads
Quantization date 2026-04-30
Quantization time ~3 hours (CPU, 64 threads)
Command llama-quantize --imatrix kimi_v2_bf16.imatrix --tensor-type-file tensor_types.txt <src> <dst> Q4_K_M 64

Architecture

Base model: Kimi K2.6 (moonshotai/Kimi-K2-Instruct)

Property Value
Architecture DeepSeek-V2 (MoE + MLA)
Parameters ~1T total, ~32B active per token
Hidden size 7,168
Decoder layers 61
Routed experts 384 (8 active per token)
Shared experts 1
Attention Multi-head Latent Attention (MLA)
Native context window 256K tokens (n_ctx_train = 262144)
Vision MoonViT 400M (image input; video not currently supported by llama.cpp)

Usage

llama.cpp (text only)

llama-cli \
  --model Kimi-K2.6-SmartQuant-V2.gguf \
  --ctx-size 32768 \
  --temp 0.6 --top-p 0.95 \
  -p "You are a helpful assistant."

llama.cpp (with image input)

llama-cli \
  --model Kimi-K2.6-SmartQuant-V2.gguf \
  --mmproj mmproj-F16.gguf \
  --ctx-size 32768 \
  --temp 0.6 --top-p 0.95

llama-server (API use)

llama-server \
  --model Kimi-K2.6-SmartQuant-V2.gguf \
  --ctx-size 131072 \
  --n-predict 4096 \
  -ngl 999 \
  --flash-attn \
  --cont-batching \
  --port 8080

-np (parallel sequences): Splits your total KV cache budget across N concurrent request slots. Total KV memory = ctx_size × n_parallel, allocated upfront regardless of actual load. -np 1 --ctx-size 524288 costs the same KV memory as -np 4 --ctx-size 131072 — the tradeoff is one large context vs. four smaller concurrent ones. For single-user local inference, -np 1 with the largest context you can afford is usually optimal.

Recommended hardware: 384GB+ unified or VRAM for full GPU offload. Can run CPU-only with sufficient RAM (~350GB+) at ~2 tok/s.


Memory Usage

Real-world measurements pending — will be updated once the model is fully benchmarked under load.

Estimated at 131K context (-np 1): ~330GB model weights + ~35GB KV cache ≈ ~365GB total.


Benchmark Results

Full benchmark suite in progress — results will be updated when complete. Evaluated using lm-evaluation-harness.

Open LLM Leaderboard v2

Task Metric V1 V2
IFEval prompt_level_strict_acc pending pending
BBH normalized_acc pending pending
MATH Level 5 exact_match pending pending
GPQA acc_norm pending pending
MuSR acc_norm pending pending
MMLU-Pro acc pending pending

Classic Benchmarks

Task Metric V1 V2
ARC Challenge (25-shot) acc_norm pending pending
ARC Easy (25-shot) acc_norm pending pending
HellaSwag (10-shot) acc_norm pending pending
TruthfulQA MC2 (0-shot) acc pending pending
GSM8K (5-shot) flexible-extract pending pending
MMLU (5-shot CoT) flexible-extract pending pending

Notes

  • Video input is not currently supported by llama.cpp (no upstream video pipeline); image input works via mmproj-F16.gguf
  • ffn_gate_inp and exp_probs_b tensors are kept at F32 — do not re-quantize these
  • For extended context beyond 256K, use YaRN: --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144

Attribution

This is a quantized derivative of Kimi K2.6 by Moonshot AI.

Changes Made

This repository contains a modified version of the original Kimi-K2-Instruct model weights. The following changes were made:

  1. Quantization: Model weights converted from BF16 to mixed-precision GGUF format using llama-quantize with an importance matrix (imatrix). Different tensor types received different quantization levels (Q2_K through Q8_0 and F32) based on architectural sensitivity — see Quantization Strategy table above.
  2. Format change: Converted from PyTorch/safetensors shards to single-file GGUF format compatible with llama.cpp.
  3. Vision encoder: mmproj-F16.gguf sourced separately from unsloth/Kimi-K2.6-GGUF and included as-is (F16, unmodified).

No fine-tuning, RLHF, or changes to model behavior were made. This is a precision-reduction of the original weights only.

License

This derivative work is released under the same Modified MIT License as the original model. Full license text (as required by MIT):

Modified MIT License

Copyright (c) 2025 Moonshot AI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Our only modification part is that, if the Software (or any derivative works
thereof) is used for any of your commercial products or services that have
more than 100 million monthly active users, or more than 20 million US dollars
(or equivalent in other currencies) in monthly revenue, you shall prominently
display "Kimi K2" on the user interface of such product or service.

Quantized by FinnTheAI · 2026-04-30

Downloads last month
730
GGUF
Model size
1T params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FinnTheAI/Kimi-K2.6-SmartQuant-GGUF

Quantized
(19)
this model