Kimi K2.6 REAM-25% — INT4 Quantization

This is a 25% expert-merged, INT4 quantized variant of moonshotai/Kimi-K2.6, a ~1T parameter Mixture-of-Experts (MoE) multimodal model with native vision capabilities.

Why this model exists: The full Kimi-K2.6 model (384 experts per MoE layer, ~555 GB) cannot fit on 4× H200 (4× 141 GB = 564 GB) with any meaningful KV cache, making the full 262,144-token context window unusable. This variant merges 25% of routed experts via REAM and quantizes to INT4, reducing model size to ~422 GB — fitting on 4× H200 with sufficient VRAM headroom for KV cache at full 256K context. All throughput benchmarks below were measured on 4× H200 NVL.

If you have access to larger hardware (8× H200, B200, etc.), you are better off using the full moonshotai/Kimi-K2.6 model instead. This quantized variant is specifically for 4× H200 deployments where the full model cannot run.

The base model has 384 routed experts per MoE layer. REAM (Router-weighted Expert Activation Pruning) reduces this to 288 experts (25% merge) based on saliency scores from a multi-domain calibration dataset. The resulting model is quantized to INT4 g32 symmetric pack-quantized format (compressed-tensors) with full vision encoder preserved in BF16.

Model Details

Property Value
Base Model moonshotai/Kimi-K2.6
Quantization Method llm-compressor oneshot (INT4, group_size=32, symmetric)
Weight Precision INT4 (4-bit, symmetric)
Activation Precision BF16 (weight-only quantization)
Group Size 32
Quantization Library llm-compressor
Format compressed-tensors (pack-quantized)
Architecture KimiK25ForConditionalGeneration
Model Type deepseek_v3 MoE
Total Parameters ~1T
Active Parameters 32B
MoE Layers 60 routed + 1 shared
Routed Experts (base) 384
Routed Experts (merged) 288 (25% reduction)
Shared Experts 1
Attention Heads 64
Key/Value Heads 64
KV LoRA Rank 512
Hidden Size 7168
MoE Intermediate Size 2048
Vocabulary Size 163,840
Context Window 262,144 tokens
Vision Encoder MoonViT (27 layers, hidden_size=1152, patch_size=14)
Quantized Components Text decoder Linear layers only
Preserved in BF16 Full vision encoder (all 27 layers)
Model Size ~422 GB
Expert Merging REAM (Router-weighted Expert Activation Pruning, 25% merge)
License Modified MIT

Quantization Details

Quantization Configuration

{
  "config_groups": {
    "group_0": {
      "targets": ["Linear"],
      "weights": {
        "num_bits": 4,
        "type": "int",
        "group_size": 32,
        "strategy": "group",
        "symmetric": true
      }
    }
  },
  "format": "pack-quantized",
  "quant_method": "compressed-tensors",
  "quantization_status": "compressed"
}

Key parameters:

  • INT4 symmetric: 4-bit integer weights with symmetric quantization (no zero-point)
  • group_size=32: Per-32-element scaling groups (fine-grained, high quality)
  • pack-quantized format: Packed INT4 weights for efficient storage and Marlin kernel execution
  • Expert merging via REAM: Router-weighted saliency scores from a multi-domain calibration dataset determine which experts to merge; 25% of routed experts are pruned by fusing low-saliency experts into remaining ones

Calibration Dataset

The REAM expert merging and INT4 quantization calibration used a multi-domain composite dataset of ~1,280 batches (batch_size=4, ~5,120 samples, model_max_length=2048):

Source Domain Batches HF ID / Path
Mixture-of-Thoughts (code subset) SDLC code reasoning traces 512 open-r1/Mixture-of-Thoughts[code]
Magicoder-Evol-Instruct Code instruction (SDLC) 256 ise-uiuc/Magicoder-Evol-Instruct-110K
Slovak SFT Slovak language instruction 256 Local: slovak_sft_256.jsonl
OpenR1 Math Mathematical reasoning 256 Local: openr1_math_256.jsonl

Quality Benchmarks

All benchmarks on wikitext-2-raw-v1 (test split), 512-token chunks.

WikiText-2 Perplexity (ctx=512)

Metric Value
PPL 5.0325

Throughput Benchmarks

All throughput benchmarks measured on 4× H200 NVL (4× 141 GB, tensor-parallel=4, expert-parallel enabled).

Single Request (concurrency=1)

20 requests, max_tokens=1024, temperature=0.9.

Metric Value
Aggregate throughput 69.4 tok/s
Per-request throughput min 63.7 / avg 69.4 / max 70.1 tok/s
Average latency 14.76 s
Min / Max latency 14.60 s / 16.08 s
Completion tokens min 1024 / avg 1024 / max 1024
Success rate 20/20 (100%)

16 Concurrent Requests

320 total requests, max_tokens=1024, temperature=0.9.

Metric Value
Aggregate throughput 590.2 tok/s
Per-request throughput min 11.2 / avg 37.0 / max 48.5 tok/s
Average latency 20.76 s
Min / Max latency 1.21 s / 31.30 s
Completion tokens min 40 / avg 781 / max 1024
Success rate 320/320 (100%)

Hardware Requirements

GPU VRAM Quantization Notes
4× H200 NVL 141 GB each INT4 compressed-tensors Target hardware — full 256K context + KV cache
8× H100 / H200 141 GB each Use full model instead
8× B200 Use full model instead
2× H200 NVL 141 GB each INT4 Insufficient VRAM

Minimum: 4× GPU with ≥120 GB total VRAM for this quantized model.

Users with 8× H200, B200, or larger setups: the full unquantized moonshotai/Kimi-K2.6 will deliver better quality. This model exists because the full model cannot run on 4× H200 with usable KV cache.

INT4 pack-quantized requires CUDA GPU sm80+ (Ampere and newer). This model uses custom MoE linear layers with INT4 weights + BF16 activations.

When to Use This Model

Your Hardware Recommendation
4× H200 (141 GB each) ✅ This model — full 256K context with KV cache
8× H200 / 8× H100 ❌ Use full moonshotai/Kimi-K2.6 instead
B200 / 8× B200 ❌ Use full moonshotai/Kimi-K2.6 instead
2× H200 ❌ insufficient VRAM

This model exists because the full Kimi-K2.6 (~555 GB, 384 experts) cannot run on 4× H200 with any usable KV cache. REAM expert merging + INT4 quantization reduces the model to ~422 GB, leaving enough VRAM for KV cache at full 262,144-token context. If your hardware can accommodate the full model, use it — quality will be higher.

Usage with vLLM

Tested with: vllm/vllm-openai:v0.20.0-cu130-ubuntu2404 on 4× H200 NVL

Docker Deployment

docker run -d --name vllm-kimi \
  --gpus all \
  --shm-size=16g \
  --network host \
  --ipc host \
  --pid host \
  --restart=unless-stopped \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_P2P_LEVEL=NVL \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_NET_DISABLE=1 \
  -e NCCL_SHM_DISABLE=0 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -v /path/to/model:/model \
  vllm/vllm-openai:v0.20.0-cu130-ubuntu2404 \
    /model \
    --served-model-name kimi-k2.6-ream-25 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --disable-custom-all-reduce \
    --gpu-memory-utilization 0.92 \
    --max-model-len 262144 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 8 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --kv-cache-dtype fp8_e4m3 \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --host 0.0.0.0 \
    --port 8000

Example vLLM Configuration (YAML)

This configuration is deployed and verified on 4× H200 NVL:

# -- Model & Server ----------------------------------------------------------
model: /model
host: "0.0.0.0"
port: 8000
served_model_name: "kimi-k2.6-ream-25"
trust_remote_code: true
tensor_parallel_size: 4
enable_expert_parallel: true

# -- Quantization ------------------------------------------------------------
quantization: compressed-tensors

# -- Data Type ---------------------------------------------------------------
dtype: bfloat16

# -- Context & Batching ------------------------------------------------------
max_model_len: 262144
max_num_batched_tokens: 8192
max_num_seqs: 8
enable_chunked_prefill: true

# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.92
enable_prefix_caching: true
kv_cache_dtype: fp8_e4m3

# -- Tool Calling & Reasoning ------------------------------------------------
enable_auto_tool_choice: true
tool_call_parser: kimi_k2
reasoning_parser: kimi_k2

# -- NCCL / Multi-GPU --------------------------------------------------------
disable_custom_all_reduce: true

# -- Misc --------------------------------------------------------------------
enforce_eager: false

Inference Test

# Text completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi-k2.6-ream-25","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Transformers / Python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "gratex/Kimi-K2.6-REAM-25"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [{"role": "user", "content": "What is 2+2?"}]
text = tokenizer.apply_chat_template(messages, return_tensors=True)
output = model.generate(**text, max_new_tokens=256)

Files in This Repository

File Description
model-*.safetensors Quantized model weights (INT4 LM + BF16 vision)
config.json Model configuration with quantization_config
tokenizer.json Vocabulary file
tokenizer_config.json Tokenizer configuration
generation_config.json Default generation parameters
preprocessor_config.json Image preprocessor configuration

License

This quantized model is released under the Modified MIT License, following the base model's license.

The base model moonshotai/Kimi-K2.6 is licensed under Modified MIT. See LICENSE for the full license text.

Citation

If you use this model in your research or production systems, please cite:

@misc{kimi-k2.6-ream-25,
  title  = {Kimi K2.6 REAM-25% INT4 Quantization},
  author = {Gratex International},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Kimi-K2.6-REAM-25}},
  note   = {REAM expert merge + INT4 quantization, 4× H200 deployment}
}

Acknowledgments

This quantization was produced using hardware and infrastructure provided by Gratex International, a.s.


Base Model: moonshotai/Kimi-K2.6
Expert Merging: REAM (Router-weighted Expert Activation Pruning)
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors
Deployment Engine: vLLM

Downloads last month
72
Safetensors
Model size
130B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gratex/Kimi-K2.6-REAM-25

Quantized
(32)
this model

Paper for gratex/Kimi-K2.6-REAM-25