File size: 6,576 Bytes

---
base_model: moonshotai/Kimi-K2.5
language:
  - en
license: other
license_name: kimi-k2.5-license
license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
tags:
  - quantization
  - compressed-tensors
  - gsq
  - 2-bit
  - moe
  - multimodal
---

# Kimi-K2.5 — 2-bit GSQ Quantization

This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.

> **Note — Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.

## Model Details

| Property | Value |
|---|---|
| Base model | [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) |
| Architecture | MoE multimodal LLM (DeepSeek-V3-style MoE) |
| Transformer layers | 61 |
| Routed experts | 384 (8 active per token) |
| Hidden size | 7168 |
| Context length | 262,144 tokens (256K) |
| Total parameters | ~547B |
| Quantization | 2-bit GSQ (stored as INT4-packed via compressed-tensors) |
| Quantized layers | Expert FFN weights, layers 1–60 |
| Group size | 128 |
| Calibration dataset | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) |
| Weight format | compressed-tensors, `pack-quantized` |
| Disk size | ~511 GB |

## Results

### Benchmark Results (lm-evaluation-harness)

| Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Δ |
|---|---|---|---|---|
| GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 |
| ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 |
| ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 |
| PIQA | acc_norm | 86.29 | 82.37 | -3.92 |
| WinoGrande | acc | 80.82 | 76.95 | -3.87 |

### Perplexity (WikiText-2)

Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:

| Checkpoint | WikiText-2 PPL |
|---|---|
| Dense baseline | 1.734 |
| After layer 6 | 1.734 |
| After layer 12 | 1.733 |
| After layer 24 | 1.733 |
| After layer 36 | 1.735 |
| After layer 48 | 1.741 |
| After layer 60 (final) | **1.749** |

The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).

## Quantization Details

This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128.

Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision:
- Attention projections (`self_attn`)
- Embeddings and the LM head
- Layer norms
- The shared expert
- Layer 0's dense MLP
- All vision tower and multimodal projector weights

## Usage

This model requires **vLLM** for inference. Because Kimi-K2.5 uses a custom model architecture (`kimi_k25`), you must pass `--trust-remote-code`.

While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, **8× NVIDIA GH200 96 GB GPUs** (2 nodes with tensor parallelism 8) are needed for serving.

### Installation

```bash
pip install vllm
```

### Serving with vLLM

```bash
vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend ray \
    --tokenizer-mode hf \
    --mm-encoder-tp-mode data \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4
```

**Flag notes:**
- `--tokenizer-mode hf`: Required to prevent garbled output on extended serving sessions (vLLM issue [#35718](https://github.com/vllm-project/vllm/issues/35718)).
- `--mm-encoder-tp-mode data`: Required for Kimi-K2.5's vision encoder — ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag.
- `--max-model-len 4096`: Adjust upward if GPU memory permits; 4096 is what was used during our testing.
- `--distributed-executor-backend ray`: Required for multi-node serving.

### Offline inference with vLLM

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
    tensor_parallel_size=8,
    tokenizer_mode="hf",
    mm_encoder_tp_mode="data",
    max_model_len=4096,
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024)

outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params)
print(outputs[0].outputs[0].text)
```

### Chat template

Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
```

## Limitations

- This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning.
- Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized).
- The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification.

## License

This model is derived from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) and is subject to the same [license terms](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE). Please review those terms before use.