File size: 6,576 Bytes
cc81193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8fc2e19
74b3ecc
8fc2e19
cc81193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8fc2e19
cc81193
8fc2e19
cc81193
8fc2e19
 
 
 
 
 
 
cc81193
8fc2e19
cc81193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8fc2e19
 
 
7aaaf63
8fc2e19
 
 
 
 
 
 
7aaaf63
cc81193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
base_model: moonshotai/Kimi-K2.5
language:
  - en
license: other
license_name: kimi-k2.5-license
license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
tags:
  - quantization
  - compressed-tensors
  - gsq
  - 2-bit
  - moe
  - multimodal
---

# Kimi-K2.5 — 2-bit GSQ Quantization

This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.

> **Note — Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.

## Model Details

| Property | Value |
|---|---|
| Base model | [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) |
| Architecture | MoE multimodal LLM (DeepSeek-V3-style MoE) |
| Transformer layers | 61 |
| Routed experts | 384 (8 active per token) |
| Hidden size | 7168 |
| Context length | 262,144 tokens (256K) |
| Total parameters | ~547B |
| Quantization | 2-bit GSQ (stored as INT4-packed via compressed-tensors) |
| Quantized layers | Expert FFN weights, layers 1–60 |
| Group size | 128 |
| Calibration dataset | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) |
| Weight format | compressed-tensors, `pack-quantized` |
| Disk size | ~511 GB |

## Results

### Benchmark Results (lm-evaluation-harness)

| Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Δ |
|---|---|---|---|---|
| GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 |
| ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 |
| ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 |
| PIQA | acc_norm | 86.29 | 82.37 | -3.92 |
| WinoGrande | acc | 80.82 | 76.95 | -3.87 |

### Perplexity (WikiText-2)

Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:

| Checkpoint | WikiText-2 PPL |
|---|---|
| Dense baseline | 1.734 |
| After layer 6 | 1.734 |
| After layer 12 | 1.733 |
| After layer 24 | 1.733 |
| After layer 36 | 1.735 |
| After layer 48 | 1.741 |
| After layer 60 (final) | **1.749** |

The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).

## Quantization Details

This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128.

Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision:
- Attention projections (`self_attn`)
- Embeddings and the LM head
- Layer norms
- The shared expert
- Layer 0's dense MLP
- All vision tower and multimodal projector weights

## Usage

This model requires **vLLM** for inference. Because Kimi-K2.5 uses a custom model architecture (`kimi_k25`), you must pass `--trust-remote-code`.

While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, **8× NVIDIA GH200 96 GB GPUs** (2 nodes with tensor parallelism 8) are needed for serving.

### Installation

```bash
pip install vllm
```

### Serving with vLLM

```bash
vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --distributed-executor-backend ray \
    --tokenizer-mode hf \
    --mm-encoder-tp-mode data \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4
```

**Flag notes:**
- `--tokenizer-mode hf`: Required to prevent garbled output on extended serving sessions (vLLM issue [#35718](https://github.com/vllm-project/vllm/issues/35718)).
- `--mm-encoder-tp-mode data`: Required for Kimi-K2.5's vision encoder — ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag.
- `--max-model-len 4096`: Adjust upward if GPU memory permits; 4096 is what was used during our testing.
- `--distributed-executor-backend ray`: Required for multi-node serving.

### Offline inference with vLLM

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
    tensor_parallel_size=8,
    tokenizer_mode="hf",
    mm_encoder_tp_mode="data",
    max_model_len=4096,
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024)

outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params)
print(outputs[0].outputs[0].text)
```

### Chat template

Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "daslab-testing/Kimi-K2.5-2bit-GSQ",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
```

## Limitations

- This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning.
- Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized).
- The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification.

## License

This model is derived from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) and is subject to the same [license terms](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE). Please review those terms before use.