| --- |
| base_model: moonshotai/Kimi-K2.5 |
| language: |
| - en |
| license: other |
| license_name: kimi-k2.5-license |
| license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE |
| tags: |
| - quantization |
| - compressed-tensors |
| - gsq |
| - 2-bit |
| - moe |
| - multimodal |
| --- |
| |
| # Kimi-K2.5 — 2-bit GSQ Quantization |
|
|
| This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference. |
|
|
| > **Note — Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) | |
| | Architecture | MoE multimodal LLM (DeepSeek-V3-style MoE) | |
| | Transformer layers | 61 | |
| | Routed experts | 384 (8 active per token) | |
| | Hidden size | 7168 | |
| | Context length | 262,144 tokens (256K) | |
| | Total parameters | ~547B | |
| | Quantization | 2-bit GSQ (stored as INT4-packed via compressed-tensors) | |
| | Quantized layers | Expert FFN weights, layers 1–60 | |
| | Group size | 128 | |
| | Calibration dataset | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) | |
| | Weight format | compressed-tensors, `pack-quantized` | |
| | Disk size | ~511 GB | |
|
|
| ## Results |
|
|
| ### Benchmark Results (lm-evaluation-harness) |
|
|
| | Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Δ | |
| |---|---|---|---|---| |
| | GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 | |
| | ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 | |
| | ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 | |
| | PIQA | acc_norm | 86.29 | 82.37 | -3.92 | |
| | WinoGrande | acc | 80.82 | 76.95 | -3.87 | |
|
|
| ### Perplexity (WikiText-2) |
|
|
| Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed: |
|
|
| | Checkpoint | WikiText-2 PPL | |
| |---|---| |
| | Dense baseline | 1.734 | |
| | After layer 6 | 1.734 | |
| | After layer 12 | 1.733 | |
| | After layer 24 | 1.733 | |
| | After layer 36 | 1.735 | |
| | After layer 48 | 1.741 | |
| | After layer 60 (final) | **1.749** | |
|
|
| The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation). |
|
|
| ## Quantization Details |
|
|
| This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128. |
|
|
| Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision: |
| - Attention projections (`self_attn`) |
| - Embeddings and the LM head |
| - Layer norms |
| - The shared expert |
| - Layer 0's dense MLP |
| - All vision tower and multimodal projector weights |
|
|
| ## Usage |
|
|
| This model requires **vLLM** for inference. Because Kimi-K2.5 uses a custom model architecture (`kimi_k25`), you must pass `--trust-remote-code`. |
|
|
| While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, **8× NVIDIA GH200 96 GB GPUs** (2 nodes with tensor parallelism 8) are needed for serving. |
|
|
| ### Installation |
|
|
| ```bash |
| pip install vllm |
| ``` |
|
|
| ### Serving with vLLM |
|
|
| ```bash |
| vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \ |
| --trust-remote-code \ |
| --tensor-parallel-size 8 \ |
| --distributed-executor-backend ray \ |
| --tokenizer-mode hf \ |
| --mm-encoder-tp-mode data \ |
| --max-model-len 4096 \ |
| --gpu-memory-utilization 0.85 \ |
| --max-num-seqs 4 |
| ``` |
|
|
| **Flag notes:** |
| - `--tokenizer-mode hf`: Required to prevent garbled output on extended serving sessions (vLLM issue [#35718](https://github.com/vllm-project/vllm/issues/35718)). |
| - `--mm-encoder-tp-mode data`: Required for Kimi-K2.5's vision encoder — ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag. |
| - `--max-model-len 4096`: Adjust upward if GPU memory permits; 4096 is what was used during our testing. |
| - `--distributed-executor-backend ray`: Required for multi-node serving. |
|
|
| ### Offline inference with vLLM |
|
|
| ```python |
| from vllm import LLM, SamplingParams |
| |
| llm = LLM( |
| model="daslab-testing/Kimi-K2.5-2bit-GSQ", |
| trust_remote_code=True, |
| tensor_parallel_size=8, |
| tokenizer_mode="hf", |
| mm_encoder_tp_mode="data", |
| max_model_len=4096, |
| gpu_memory_utilization=0.85, |
| ) |
| |
| sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024) |
| |
| outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params) |
| print(outputs[0].outputs[0].text) |
| ``` |
|
|
| ### Chat template |
|
|
| Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository: |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| "daslab-testing/Kimi-K2.5-2bit-GSQ", |
| trust_remote_code=True, |
| ) |
| |
| messages = [{"role": "user", "content": "What is 2+2?"}] |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| ``` |
|
|
| ## Limitations |
|
|
| - This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning. |
| - Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized). |
| - The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification. |
|
|
| ## License |
|
|
| This model is derived from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) and is subject to the same [license terms](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE). Please review those terms before use. |
|
|