Kimi K2.6 REAM-25% — INT4 Quantization
This is a 25% expert-merged, INT4 quantized variant of moonshotai/Kimi-K2.6, a ~1T parameter Mixture-of-Experts (MoE) multimodal model with native vision capabilities.
Why this model exists: The full Kimi-K2.6 model (384 experts per MoE layer, ~555 GB) cannot fit on 4× H200 (4× 141 GB = 564 GB) with any meaningful KV cache, making the full 262,144-token context window unusable. This variant merges 25% of routed experts via REAM and quantizes to INT4, reducing model size to ~422 GB — fitting on 4× H200 with sufficient VRAM headroom for KV cache at full 256K context. All throughput benchmarks below were measured on 4× H200 NVL.
If you have access to larger hardware (8× H200, B200, etc.), you are better off using the full moonshotai/Kimi-K2.6 model instead. This quantized variant is specifically for 4× H200 deployments where the full model cannot run.
The base model has 384 routed experts per MoE layer. REAM (Router-weighted Expert Activation Pruning) reduces this to 288 experts (25% merge) based on saliency scores from a multi-domain calibration dataset. The resulting model is quantized to INT4 g32 symmetric pack-quantized format (compressed-tensors) with full vision encoder preserved in BF16.
Model Details
| Property | Value |
|---|---|
| Base Model | moonshotai/Kimi-K2.6 |
| Quantization Method | llm-compressor oneshot (INT4, group_size=32, symmetric) |
| Weight Precision | INT4 (4-bit, symmetric) |
| Activation Precision | BF16 (weight-only quantization) |
| Group Size | 32 |
| Quantization Library | llm-compressor |
| Format | compressed-tensors (pack-quantized) |
| Architecture | KimiK25ForConditionalGeneration |
| Model Type | deepseek_v3 MoE |
| Total Parameters | ~1T |
| Active Parameters | 32B |
| MoE Layers | 60 routed + 1 shared |
| Routed Experts (base) | 384 |
| Routed Experts (merged) | 288 (25% reduction) |
| Shared Experts | 1 |
| Attention Heads | 64 |
| Key/Value Heads | 64 |
| KV LoRA Rank | 512 |
| Hidden Size | 7168 |
| MoE Intermediate Size | 2048 |
| Vocabulary Size | 163,840 |
| Context Window | 262,144 tokens |
| Vision Encoder | MoonViT (27 layers, hidden_size=1152, patch_size=14) |
| Quantized Components | Text decoder Linear layers only |
| Preserved in BF16 | Full vision encoder (all 27 layers) |
| Model Size | ~422 GB |
| Expert Merging | REAM (Router-weighted Expert Activation Pruning, 25% merge) |
| License | Modified MIT |
Quantization Details
Quantization Configuration
{
"config_groups": {
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"type": "int",
"group_size": 32,
"strategy": "group",
"symmetric": true
}
}
},
"format": "pack-quantized",
"quant_method": "compressed-tensors",
"quantization_status": "compressed"
}
Key parameters:
- INT4 symmetric: 4-bit integer weights with symmetric quantization (no zero-point)
- group_size=32: Per-32-element scaling groups (fine-grained, high quality)
- pack-quantized format: Packed INT4 weights for efficient storage and Marlin kernel execution
- Expert merging via REAM: Router-weighted saliency scores from a multi-domain calibration dataset determine which experts to merge; 25% of routed experts are pruned by fusing low-saliency experts into remaining ones
Calibration Dataset
The REAM expert merging and INT4 quantization calibration used a multi-domain composite dataset of ~1,280 batches (batch_size=4, ~5,120 samples, model_max_length=2048):
| Source | Domain | Batches | HF ID / Path |
|---|---|---|---|
| Mixture-of-Thoughts (code subset) | SDLC code reasoning traces | 512 | open-r1/Mixture-of-Thoughts[code] |
| Magicoder-Evol-Instruct | Code instruction (SDLC) | 256 | ise-uiuc/Magicoder-Evol-Instruct-110K |
| Slovak SFT | Slovak language instruction | 256 | Local: slovak_sft_256.jsonl |
| OpenR1 Math | Mathematical reasoning | 256 | Local: openr1_math_256.jsonl |
Quality Benchmarks
All benchmarks on wikitext-2-raw-v1 (test split), 512-token chunks.
WikiText-2 Perplexity (ctx=512)
| Metric | Value |
|---|---|
| PPL | 5.0325 |
Throughput Benchmarks
All throughput benchmarks measured on 4× H200 NVL (4× 141 GB, tensor-parallel=4, expert-parallel enabled).
Single Request (concurrency=1)
20 requests, max_tokens=1024, temperature=0.9.
| Metric | Value |
|---|---|
| Aggregate throughput | 69.4 tok/s |
| Per-request throughput | min 63.7 / avg 69.4 / max 70.1 tok/s |
| Average latency | 14.76 s |
| Min / Max latency | 14.60 s / 16.08 s |
| Completion tokens | min 1024 / avg 1024 / max 1024 |
| Success rate | 20/20 (100%) |
16 Concurrent Requests
320 total requests, max_tokens=1024, temperature=0.9.
| Metric | Value |
|---|---|
| Aggregate throughput | 590.2 tok/s |
| Per-request throughput | min 11.2 / avg 37.0 / max 48.5 tok/s |
| Average latency | 20.76 s |
| Min / Max latency | 1.21 s / 31.30 s |
| Completion tokens | min 40 / avg 781 / max 1024 |
| Success rate | 320/320 (100%) |
Hardware Requirements
| GPU | VRAM | Quantization | Notes |
|---|---|---|---|
| 4× H200 NVL | 141 GB each | INT4 compressed-tensors | Target hardware — full 256K context + KV cache |
| 8× H100 / H200 | 141 GB each | — | Use full model instead |
| 8× B200 | — | — | Use full model instead |
| 2× H200 NVL | 141 GB each | INT4 | Insufficient VRAM |
Minimum: 4× GPU with ≥120 GB total VRAM for this quantized model.
Users with 8× H200, B200, or larger setups: the full unquantized moonshotai/Kimi-K2.6 will deliver better quality. This model exists because the full model cannot run on 4× H200 with usable KV cache.
INT4 pack-quantized requires CUDA GPU sm80+ (Ampere and newer). This model uses custom MoE linear layers with INT4 weights + BF16 activations.
When to Use This Model
| Your Hardware | Recommendation |
|---|---|
| 4× H200 (141 GB each) | ✅ This model — full 256K context with KV cache |
| 8× H200 / 8× H100 | ❌ Use full moonshotai/Kimi-K2.6 instead |
| B200 / 8× B200 | ❌ Use full moonshotai/Kimi-K2.6 instead |
| 2× H200 | ❌ insufficient VRAM |
This model exists because the full Kimi-K2.6 (~555 GB, 384 experts) cannot run on 4× H200 with any usable KV cache. REAM expert merging + INT4 quantization reduces the model to ~422 GB, leaving enough VRAM for KV cache at full 262,144-token context. If your hardware can accommodate the full model, use it — quality will be higher.
Usage with vLLM
Tested with: vllm/vllm-openai:v0.20.0-cu130-ubuntu2404 on 4× H200 NVL
Docker Deployment
docker run -d --name vllm-kimi \
--gpus all \
--shm-size=16g \
--network host \
--ipc host \
--pid host \
--restart=unless-stopped \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
-e NCCL_P2P_DISABLE=0 \
-e NCCL_P2P_LEVEL=NVL \
-e NCCL_IB_DISABLE=1 \
-e NCCL_NET_DISABLE=1 \
-e NCCL_SHM_DISABLE=0 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-v /path/to/model:/model \
vllm/vllm-openai:v0.20.0-cu130-ubuntu2404 \
/model \
--served-model-name kimi-k2.6-ream-25 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--disable-custom-all-reduce \
--gpu-memory-utilization 0.92 \
--max-model-len 262144 \
--max-num-batched-tokens 8192 \
--max-num-seqs 8 \
--enable-chunked-prefill \
--enable-prefix-caching \
--kv-cache-dtype fp8_e4m3 \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--dtype bfloat16 \
--host 0.0.0.0 \
--port 8000
Example vLLM Configuration (YAML)
This configuration is deployed and verified on 4× H200 NVL:
# -- Model & Server ----------------------------------------------------------
model: /model
host: "0.0.0.0"
port: 8000
served_model_name: "kimi-k2.6-ream-25"
trust_remote_code: true
tensor_parallel_size: 4
enable_expert_parallel: true
# -- Quantization ------------------------------------------------------------
quantization: compressed-tensors
# -- Data Type ---------------------------------------------------------------
dtype: bfloat16
# -- Context & Batching ------------------------------------------------------
max_model_len: 262144
max_num_batched_tokens: 8192
max_num_seqs: 8
enable_chunked_prefill: true
# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.92
enable_prefix_caching: true
kv_cache_dtype: fp8_e4m3
# -- Tool Calling & Reasoning ------------------------------------------------
enable_auto_tool_choice: true
tool_call_parser: kimi_k2
reasoning_parser: kimi_k2
# -- NCCL / Multi-GPU --------------------------------------------------------
disable_custom_all_reduce: true
# -- Misc --------------------------------------------------------------------
enforce_eager: false
Inference Test
# Text completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"kimi-k2.6-ream-25","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'
Transformers / Python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "gratex/Kimi-K2.6-REAM-25"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
messages = [{"role": "user", "content": "What is 2+2?"}]
text = tokenizer.apply_chat_template(messages, return_tensors=True)
output = model.generate(**text, max_new_tokens=256)
Files in This Repository
| File | Description |
|---|---|
model-*.safetensors |
Quantized model weights (INT4 LM + BF16 vision) |
config.json |
Model configuration with quantization_config |
tokenizer.json |
Vocabulary file |
tokenizer_config.json |
Tokenizer configuration |
generation_config.json |
Default generation parameters |
preprocessor_config.json |
Image preprocessor configuration |
License
This quantized model is released under the Modified MIT License, following the base model's license.
The base model moonshotai/Kimi-K2.6 is licensed under Modified MIT. See LICENSE for the full license text.
Citation
If you use this model in your research or production systems, please cite:
@misc{kimi-k2.6-ream-25,
title = {Kimi K2.6 REAM-25% INT4 Quantization},
author = {Gratex International},
year = {2026},
howpublished = {\url{https://huggingface.co/gratex/Kimi-K2.6-REAM-25}},
note = {REAM expert merge + INT4 quantization, 4× H200 deployment}
}
Acknowledgments
This quantization was produced using hardware and infrastructure provided by Gratex International, a.s.
Base Model: moonshotai/Kimi-K2.6
Expert Merging: REAM (Router-weighted Expert Activation Pruning)
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors
Deployment Engine: vLLM
- Downloads last month
- 72
Model tree for gratex/Kimi-K2.6-REAM-25
Base model
moonshotai/Kimi-K2.6