Kimi-K2.5-2bit-GSQ / README.md
soroushtabesh's picture
Upload README.md with huggingface_hub
8fc2e19 verified
---
base_model: moonshotai/Kimi-K2.5
language:
- en
license: other
license_name: kimi-k2.5-license
license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
tags:
- quantization
- compressed-tensors
- gsq
- 2-bit
- moe
- multimodal
---
# Kimi-K2.5 — 2-bit GSQ Quantization
This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ**, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
> **Note — Simulated quantization:** The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.
## Model Details
| Property | Value |
|---|---|
| Base model | [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) |
| Architecture | MoE multimodal LLM (DeepSeek-V3-style MoE) |
| Transformer layers | 61 |
| Routed experts | 384 (8 active per token) |
| Hidden size | 7168 |
| Context length | 262,144 tokens (256K) |
| Total parameters | ~547B |
| Quantization | 2-bit GSQ (stored as INT4-packed via compressed-tensors) |
| Quantized layers | Expert FFN weights, layers 1–60 |
| Group size | 128 |
| Calibration dataset | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) |
| Weight format | compressed-tensors, `pack-quantized` |
| Disk size | ~511 GB |
## Results
### Benchmark Results (lm-evaluation-harness)
| Benchmark | Metric | Baseline (BF16) | GSQ 2-bit | Δ |
|---|---|---|---|---|
| GSM8K | exact_match (strict) | 94.01 | 92.57 | -1.44 |
| ARC-Challenge | acc_norm | 70.14 | 62.97 | -7.17 |
| ARC-Easy | acc_norm | 88.80 | 85.10 | -3.70 |
| PIQA | acc_norm | 86.29 | 82.37 | -3.92 |
| WinoGrande | acc | 80.82 | 76.95 | -3.87 |
### Perplexity (WikiText-2)
Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:
| Checkpoint | WikiText-2 PPL |
|---|---|
| Dense baseline | 1.734 |
| After layer 6 | 1.734 |
| After layer 12 | 1.733 |
| After layer 24 | 1.733 |
| After layer 36 | 1.735 |
| After layer 48 | 1.741 |
| After layer 60 (final) | **1.749** |
The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).
## Quantization Details
This model was quantized using **GSQ**, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128.
Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision:
- Attention projections (`self_attn`)
- Embeddings and the LM head
- Layer norms
- The shared expert
- Layer 0's dense MLP
- All vision tower and multimodal projector weights
## Usage
This model requires **vLLM** for inference. Because Kimi-K2.5 uses a custom model architecture (`kimi_k25`), you must pass `--trust-remote-code`.
While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, **8× NVIDIA GH200 96 GB GPUs** (2 nodes with tensor parallelism 8) are needed for serving.
### Installation
```bash
pip install vllm
```
### Serving with vLLM
```bash
vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \
--trust-remote-code \
--tensor-parallel-size 8 \
--distributed-executor-backend ray \
--tokenizer-mode hf \
--mm-encoder-tp-mode data \
--max-model-len 4096 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 4
```
**Flag notes:**
- `--tokenizer-mode hf`: Required to prevent garbled output on extended serving sessions (vLLM issue [#35718](https://github.com/vllm-project/vllm/issues/35718)).
- `--mm-encoder-tp-mode data`: Required for Kimi-K2.5's vision encoder — ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag.
- `--max-model-len 4096`: Adjust upward if GPU memory permits; 4096 is what was used during our testing.
- `--distributed-executor-backend ray`: Required for multi-node serving.
### Offline inference with vLLM
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="daslab-testing/Kimi-K2.5-2bit-GSQ",
trust_remote_code=True,
tensor_parallel_size=8,
tokenizer_mode="hf",
mm_encoder_tp_mode="data",
max_model_len=4096,
gpu_memory_utilization=0.85,
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024)
outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params)
print(outputs[0].outputs[0].text)
```
### Chat template
Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"daslab-testing/Kimi-K2.5-2bit-GSQ",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
```
## Limitations
- This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning.
- Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized).
- The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification.
## License
This model is derived from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) and is subject to the same [license terms](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE). Please review those terms before use.