Kimi-K2.6-SmartQuant-V2 GGUF
FinnTheAI custom imatrix-guided mixed-precision quantization of moonshotai/Kimi-K2-Instruct
V2 improves on V1 (352GB uniform Q4_K_M) with a 10% size reduction (320GB) and full imatrix calibration — importance-weighted quantization across all 1,096 tensors guided by 789 calibration data points from 100 chunks of representative text.
Model Files
| File | Size | Description |
|---|---|---|
Kimi-K2.6-SmartQuant-V2.gguf |
320 GB | Main language model — imatrix mixed-precision |
mmproj-F16.gguf |
908 MB | MoonViT vision encoder (F16, sourced from unsloth/Kimi-K2.6-GGUF) |
Quantization Strategy
Mixed-precision design targeting the DeepSeek-V2 / MLA architecture. Each component type is assigned a quantization level based on its sensitivity to precision loss:
| Component | Quant | Rationale |
|---|---|---|
Routed expert FFN (ffn_*_exps, 384×) |
Q2_K | Only 8/384 experts fire per token — highly sparse, tolerant of aggressive quantization |
Shared expert FFN (ffn_*_shexp) |
Q6_K | Active on every token — requires higher precision |
MLA value projections (attn_v_b, attn_kv_a_mqa) |
Q8_0 | Most sensitive attention weights; quality-critical |
MLA attention (attn_q_a/b, attn_kv_a/b, attn_output) |
Q6_K | Latent compression layer; moderate sensitivity |
Output (lm_head, token_embd) |
Q8_0 | Directly affects output distribution |
Boundary layers (blk.0–2, blk.58–60) |
Q6_K | First/last decoder layers empirically more sensitive |
Router/gate (ffn_gate_inp, exp_probs_b) |
F32 | Tiny tensors, routing-critical — kept full precision |
| All other tensors | Q4_K_M | Base quantization |
| Vision encoder (MoonViT 400M) | F16 | Separate mmproj-F16.gguf file |
Effective bits per weight: ~4.5 bpw (text model)
Note on the quantize command: The
Q4_K_Margument tollama-quantizesets the fallback type for any tensor not explicitly listed in--tensor-type-file. Since all 1,096 tensors in this model are covered by the tensor type file, the positional argument is effectively unused — every tensor's quant is determined by the file above.
Comparison to V1
| V1 | V2 | |
|---|---|---|
| Strategy | Uniform Q4_K_M | Mixed-precision per component |
| imatrix | No | Yes (789 entries, 100 chunks) |
| Size | 352 GB | 320 GB |
| Routed experts | Q4_K_M | Q2_K |
| Shared expert | Q4_K_M | Q6_K |
| MLA attention | Q4_K_M | Q6_K / Q8_0 |
| Output/embed | Q4_K_M | Q8_0 |
imatrix Calibration
The importance matrix was computed from the BF16 source model using llama-imatrix on an x86 CPU server:
| Property | Value |
|---|---|
| Source model | Kimi-K2.6 BF16 GGUF shards (46 × ~44 GB) |
| Calibration data | Diverse English + code + reasoning text |
| Chunks | 100 |
| Entries in imatrix | 789 |
| imatrix file size | 1.53 GB |
| Compute time | ~5 days (CPU-only, 64 threads) |
The imatrix guides llama-quantize to allocate more precision to tensor elements that have the highest impact on output quality, independent of the per-component quant type decisions above.
Build Details
| Property | Value |
|---|---|
| Source model | moonshotai/Kimi-K2-Instruct (BF16, 46 GGUF shards) |
| Quantized on | x86 server — 2× EPYC 7302, 64 threads, 251GB DDR4 |
| llama-quantize build | commit 9d34231, GCC 13.3.0, AVX2, 64 threads |
| Quantization date | 2026-04-30 |
| Quantization time | ~3 hours (CPU, 64 threads) |
| Command | llama-quantize --imatrix kimi_v2_bf16.imatrix --tensor-type-file tensor_types.txt <src> <dst> Q4_K_M 64 |
Architecture
Base model: Kimi K2.6 (moonshotai/Kimi-K2-Instruct)
| Property | Value |
|---|---|
| Architecture | DeepSeek-V2 (MoE + MLA) |
| Parameters | ~1T total, ~32B active per token |
| Hidden size | 7,168 |
| Decoder layers | 61 |
| Routed experts | 384 (8 active per token) |
| Shared experts | 1 |
| Attention | Multi-head Latent Attention (MLA) |
| Native context window | 256K tokens (n_ctx_train = 262144) |
| Vision | MoonViT 400M (image input; video not currently supported by llama.cpp) |
Usage
llama.cpp (text only)
llama-cli \
--model Kimi-K2.6-SmartQuant-V2.gguf \
--ctx-size 32768 \
--temp 0.6 --top-p 0.95 \
-p "You are a helpful assistant."
llama.cpp (with image input)
llama-cli \
--model Kimi-K2.6-SmartQuant-V2.gguf \
--mmproj mmproj-F16.gguf \
--ctx-size 32768 \
--temp 0.6 --top-p 0.95
llama-server (API use)
llama-server \
--model Kimi-K2.6-SmartQuant-V2.gguf \
--ctx-size 131072 \
--n-predict 4096 \
-ngl 999 \
--flash-attn \
--cont-batching \
--port 8080
-np (parallel sequences): Splits your total KV cache budget across N concurrent request slots. Total KV memory = ctx_size × n_parallel, allocated upfront regardless of actual load. -np 1 --ctx-size 524288 costs the same KV memory as -np 4 --ctx-size 131072 — the tradeoff is one large context vs. four smaller concurrent ones. For single-user local inference, -np 1 with the largest context you can afford is usually optimal.
Recommended hardware: 384GB+ unified or VRAM for full GPU offload. Can run CPU-only with sufficient RAM (~350GB+) at ~2 tok/s.
Memory Usage
⏳ Real-world measurements pending — will be updated once the model is fully benchmarked under load.
Estimated at 131K context (-np 1): ~330GB model weights + ~35GB KV cache ≈ ~365GB total.
Benchmark Results
⏳ Full benchmark suite in progress — results will be updated when complete. Evaluated using lm-evaluation-harness.
Open LLM Leaderboard v2
| Task | Metric | V1 | V2 |
|---|---|---|---|
| IFEval | prompt_level_strict_acc | pending | pending |
| BBH | normalized_acc | pending | pending |
| MATH Level 5 | exact_match | pending | pending |
| GPQA | acc_norm | pending | pending |
| MuSR | acc_norm | pending | pending |
| MMLU-Pro | acc | pending | pending |
Classic Benchmarks
| Task | Metric | V1 | V2 |
|---|---|---|---|
| ARC Challenge (25-shot) | acc_norm | pending | pending |
| ARC Easy (25-shot) | acc_norm | pending | pending |
| HellaSwag (10-shot) | acc_norm | pending | pending |
| TruthfulQA MC2 (0-shot) | acc | pending | pending |
| GSM8K (5-shot) | flexible-extract | pending | pending |
| MMLU (5-shot CoT) | flexible-extract | pending | pending |
Notes
- Video input is not currently supported by llama.cpp (no upstream video pipeline); image input works via
mmproj-F16.gguf ffn_gate_inpandexp_probs_btensors are kept at F32 — do not re-quantize these- For extended context beyond 256K, use YaRN:
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144
Attribution
This is a quantized derivative of Kimi K2.6 by Moonshot AI.
- Original model: moonshotai/Kimi-K2-Instruct
- Original authors: Moonshot AI (Copyright © 2025)
- Quantization by: FinnTheAI
Changes Made
This repository contains a modified version of the original Kimi-K2-Instruct model weights. The following changes were made:
- Quantization: Model weights converted from BF16 to mixed-precision GGUF format using
llama-quantizewith an importance matrix (imatrix). Different tensor types received different quantization levels (Q2_K through Q8_0 and F32) based on architectural sensitivity — see Quantization Strategy table above. - Format change: Converted from PyTorch/safetensors shards to single-file GGUF format compatible with llama.cpp.
- Vision encoder:
mmproj-F16.ggufsourced separately from unsloth/Kimi-K2.6-GGUF and included as-is (F16, unmodified).
No fine-tuning, RLHF, or changes to model behavior were made. This is a precision-reduction of the original weights only.
License
This derivative work is released under the same Modified MIT License as the original model. Full license text (as required by MIT):
Modified MIT License
Copyright (c) 2025 Moonshot AI
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Our only modification part is that, if the Software (or any derivative works
thereof) is used for any of your commercial products or services that have
more than 100 million monthly active users, or more than 20 million US dollars
(or equivalent in other currencies) in monthly revenue, you shall prominently
display "Kimi K2" on the user interface of such product or service.
Quantized by FinnTheAI · 2026-04-30
- Downloads last month
- 730
We're not able to determine the quantization variants.
Model tree for FinnTheAI/Kimi-K2.6-SmartQuant-GGUF
Base model
moonshotai/Kimi-K2-Instruct