Kimi K2.6 REAM-25% — INT4 Quantization

This is a 25% expert-merged, INT4 quantized variant of moonshotai/Kimi-K2.6, a ~1T parameter Mixture-of-Experts (MoE) multimodal model with native vision capabilities.

Why this model exists: The full Kimi-K2.6 model (384 experts per MoE layer, ~555 GB) cannot fit on 4× H200 (4× 141 GB = 564 GB) with any meaningful KV cache, making the full 262,144-token context window unusable. This variant merges 25% of routed experts via REAM and quantizes to INT4, reducing model size to ~422 GB — fitting on 4× H200 with sufficient VRAM headroom for KV cache at full 256K context. All throughput benchmarks below were measured on 4× H200 NVL.

If you have access to larger hardware (8× H200, B200, etc.), you are better off using the full moonshotai/Kimi-K2.6 model instead. This quantized variant is specifically for 4× H200 deployments where the full model cannot run.

The base model has 384 routed experts per MoE layer. REAM (Router-weighted Expert Activation Pruning) reduces this to 288 experts (25% merge) based on saliency scores from a multi-domain calibration dataset. The resulting model is quantized to INT4 g32 symmetric pack-quantized format (compressed-tensors) with full vision encoder preserved in BF16.

Model Details

Property	Value
Base Model	moonshotai/Kimi-K2.6
Quantization Method	llm-compressor oneshot (INT4, group_size=32, symmetric)
Weight Precision	INT4 (4-bit, symmetric)
Activation Precision	BF16 (weight-only quantization)
Group Size	32
Quantization Library	llm-compressor
Format	compressed-tensors (pack-quantized)
Architecture	KimiK25ForConditionalGeneration
Model Type	deepseek_v3 MoE
Total Parameters	~1T
Active Parameters	32B
MoE Layers	60 routed + 1 shared
Routed Experts (base)	384
Routed Experts (merged)	288 (25% reduction)
Shared Experts	1
Attention Heads	64
Key/Value Heads	64
KV LoRA Rank	512
Hidden Size	7168
MoE Intermediate Size	2048
Vocabulary Size	163,840
Context Window	262,144 tokens
Vision Encoder	MoonViT (27 layers, hidden_size=1152, patch_size=14)
Quantized Components	Text decoder Linear layers only
Preserved in BF16	Full vision encoder (all 27 layers)
Model Size	~422 GB
Expert Merging	REAM (Router-weighted Expert Activation Pruning, 25% merge)
License	Modified MIT

Quantization Details

Quantization Configuration

{
  "config_groups": {
    "group_0": {
      "targets": ["Linear"],
      "weights": {
        "num_bits": 4,
        "type": "int",
        "group_size": 32,
        "strategy": "group",
        "symmetric": true
      }
    }
  },
  "format": "pack-quantized",
  "quant_method": "compressed-tensors",
  "quantization_status": "compressed"
}

Key parameters:

INT4 symmetric: 4-bit integer weights with symmetric quantization (no zero-point)
group_size=32: Per-32-element scaling groups (fine-grained, high quality)
pack-quantized format: Packed INT4 weights for efficient storage and Marlin kernel execution
Expert merging via REAM: Router-weighted saliency scores from a multi-domain calibration dataset determine which experts to merge; 25% of routed experts are pruned by fusing low-saliency experts into remaining ones

Calibration Dataset

The REAM expert merging and INT4 quantization calibration used a multi-domain composite dataset of ~1,280 batches (batch_size=4, ~5,120 samples, model_max_length=2048):

Source	Domain	Batches	HF ID / Path
Mixture-of-Thoughts (code subset)	SDLC code reasoning traces	512	`open-r1/Mixture-of-Thoughts[code]`
Magicoder-Evol-Instruct	Code instruction (SDLC)	256	`ise-uiuc/Magicoder-Evol-Instruct-110K`
Slovak SFT	Slovak language instruction	256	Local: `slovak_sft_256.jsonl`
OpenR1 Math	Mathematical reasoning	256	Local: `openr1_math_256.jsonl`

Quality Benchmarks

All benchmarks on wikitext-2-raw-v1 (test split), 512-token chunks.

WikiText-2 Perplexity (ctx=512)

Metric	Value
PPL	5.0325

Throughput Benchmarks

All throughput benchmarks measured on 4× H200 NVL (4× 141 GB, tensor-parallel=4, expert-parallel enabled).

Single Request (concurrency=1)

20 requests, max_tokens=1024, temperature=0.9.

Metric	Value
Aggregate throughput	69.4 tok/s
Per-request throughput	min 63.7 / avg 69.4 / max 70.1 tok/s
Average latency	14.76 s
Min / Max latency	14.60 s / 16.08 s
Completion tokens	min 1024 / avg 1024 / max 1024
Success rate	20/20 (100%)

16 Concurrent Requests

320 total requests, max_tokens=1024, temperature=0.9.

Metric	Value
Aggregate throughput	590.2 tok/s
Per-request throughput	min 11.2 / avg 37.0 / max 48.5 tok/s
Average latency	20.76 s
Min / Max latency	1.21 s / 31.30 s
Completion tokens	min 40 / avg 781 / max 1024
Success rate	320/320 (100%)

Hardware Requirements

GPU	VRAM	Quantization	Notes
4× H200 NVL	141 GB each	INT4 compressed-tensors	Target hardware — full 256K context + KV cache
8× H100 / H200	141 GB each	—	Use full model instead
8× B200	—	—	Use full model instead
2× H200 NVL	141 GB each	INT4	Insufficient VRAM

Minimum: 4× GPU with ≥120 GB total VRAM for this quantized model.

Users with 8× H200, B200, or larger setups: the full unquantized moonshotai/Kimi-K2.6 will deliver better quality. This model exists because the full model cannot run on 4× H200 with usable KV cache.

INT4 pack-quantized requires CUDA GPU sm80+ (Ampere and newer). This model uses custom MoE linear layers with INT4 weights + BF16 activations.

When to Use This Model

Your Hardware	Recommendation
4× H200 (141 GB each)	✅ This model — full 256K context with KV cache
8× H200 / 8× H100	❌ Use full moonshotai/Kimi-K2.6 instead
B200 / 8× B200	❌ Use full moonshotai/Kimi-K2.6 instead
2× H200	❌ insufficient VRAM

This model exists because the full Kimi-K2.6 (~555 GB, 384 experts) cannot run on 4× H200 with any usable KV cache. REAM expert merging + INT4 quantization reduces the model to ~422 GB, leaving enough VRAM for KV cache at full 262,144-token context. If your hardware can accommodate the full model, use it — quality will be higher.

Usage with vLLM

Tested with: vllm/vllm-openai:v0.20.0-cu130-ubuntu2404 on 4× H200 NVL

Docker Deployment

docker run -d --name vllm-kimi \
  --gpus all \
  --shm-size=16g \
  --network host \
  --ipc host \
  --pid host \
  --restart=unless-stopped \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_P2P_LEVEL=NVL \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_NET_DISABLE=1 \
  -e NCCL_SHM_DISABLE=0 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -v /path/to/model:/model \
  vllm/vllm-openai:v0.20.0-cu130-ubuntu2404 \
    /model \
    --served-model-name kimi-k2.6-ream-25 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --disable-custom-all-reduce \
    --gpu-memory-utilization 0.92 \
    --max-model-len 262144 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 8 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --kv-cache-dtype fp8_e4m3 \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --host 0.0.0.0 \
    --port 8000

Example vLLM Configuration (YAML)

This configuration is deployed and verified on 4× H200 NVL:

# -- Model & Server ----------------------------------------------------------
model: /model
host: "0.0.0.0"
port: 8000
served_model_name: "kimi-k2.6-ream-25"
trust_remote_code: true
tensor_parallel_size: 4
enable_expert_parallel: true

# -- Quantization ------------------------------------------------------------
quantization: compressed-tensors

# -- Data Type ---------------------------------------------------------------
dtype: bfloat16

# -- Context & Batching ------------------------------------------------------
max_model_len: 262144
max_num_batched_tokens: 8192
max_num_seqs: 8
enable_chunked_prefill: true

# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.92
enable_prefix_caching: true
kv_cache_dtype: fp8_e4m3

# -- Tool Calling & Reasoning ------------------------------------------------
enable_auto_tool_choice: true
tool_call_parser: kimi_k2
reasoning_parser: kimi_k2

# -- NCCL / Multi-GPU --------------------------------------------------------
disable_custom_all_reduce: true

# -- Misc --------------------------------------------------------------------
enforce_eager: false

Inference Test

# Text completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi-k2.6-ream-25","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Transformers / Python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "gratex/Kimi-K2.6-REAM-25"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [{"role": "user", "content": "What is 2+2?"}]
text = tokenizer.apply_chat_template(messages, return_tensors=True)
output = model.generate(**text, max_new_tokens=256)

Files in This Repository

File	Description
`model-*.safetensors`	Quantized model weights (INT4 LM + BF16 vision)
`config.json`	Model configuration with `quantization_config`
`tokenizer.json`	Vocabulary file
`tokenizer_config.json`	Tokenizer configuration
`generation_config.json`	Default generation parameters
`preprocessor_config.json`	Image preprocessor configuration

License

This quantized model is released under the Modified MIT License, following the base model's license.

The base model moonshotai/Kimi-K2.6 is licensed under Modified MIT. See LICENSE for the full license text.

Citation

If you use this model in your research or production systems, please cite:

@misc{kimi-k2.6-ream-25,
  title  = {Kimi K2.6 REAM-25% INT4 Quantization},
  author = {Gratex International},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Kimi-K2.6-REAM-25}},
  note   = {REAM expert merge + INT4 quantization, 4× H200 deployment}
}

Acknowledgments

This quantization was produced using hardware and infrastructure provided by Gratex International, a.s.

Base Model: moonshotai/Kimi-K2.6
Expert Merging: REAM (Router-weighted Expert Activation Pruning)
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors
Deployment Engine: vLLM

Downloads last month: 24

Safetensors

Model size

130B params

Tensor type

BF16

I32

Model tree for gratex/Kimi-K2.6-REAM-25

Base model

moonshotai/Kimi-K2.6

Quantized

(42)

this model

Paper for gratex/Kimi-K2.6-REAM-25

REAM: Merging Improves Pruning of Experts in LLMs

Paper • 2604.04356 • Published Apr 6 • 9