---
license: mit
---

## Model Overview

**Kimi-K2-Instruct-eagle3** is a specialized draft model designed to accelerate the inference of the Kimi-K2-Instruct ecosystem using the **EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency)** framework.

Kimi-K2-Instruct with EAGLE3 achieves up to **1.8× peak throughput** versus the base model, accelerating generation across all 7 benchmarks—from +24% on MT-Bench to +80% on Math500 (configured with bs=8, steps=3, topk=1, num_draft_tokens=4).

Built upon the **Llama architecture**, this model acts as a highly efficient drafter. It has been trained on **1.4 million high-quality samples** from the **Open-PerfectBlend** dataset, ensuring strict alignment with the teacher model's distribution.

This model serves as a general-purpose English instruction follower with strong capabilities in:
* **Conversation**
* **Mathematical Reasoning**
* **Code Generation**

## Efficient Download Guide
To minimize download time and storage usage, please note the function of the files in the repository:

**For Inference**: You only need to download config.json and model.safetensors.

**For Continued Training**: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file.

## Performance & Acceleration

The core value of this EAGLE model is its ability to predict multiple future tokens that are subsequently verified by the base model. High acceptance lengths indicate significant latency reduction.

**Average Token Acceptance Lengths (MLA):**

| Benchmark | Average Acceptance Length |
| :--- | :--- |
| **HumanEval** (Code) | **3.372** |
| **GSM8K** (Math) | **3.165** |
| **Math500** (Complex Math) | **3.490** |

These metrics demonstrate robust acceleration performance across diverse and complex domains.

![1](https://hackmd.io/_uploads/ryP6cBLXbg.png)
![2](https://hackmd.io/_uploads/S1Da5BLmbl.png)
![3](https://hackmd.io/_uploads/S1v65HIm-e.png)

## Quick Start

### Requirements

- NVIDIA GPU 
- CUDA 12.0+
- PyTorch 2.0+

### Installation

```bash
pip install sglang==0.5.6
```

### Inference with SGLang

```python
python3 -m sglang.launch_server  \
    --model-path /models/Kimi-K2-Instruct \
    --host 0.0.0.0 --port 30012  \
    --trust-remote-code  \
    --attention-backend flashinfer  \
    --mem-fraction-static 0.9 \
    --tp-size 8  \
    --speculative-algorithm EAGLE3  \
    --speculative-draft-model-path AQ-MedAI/Kimi-K2-Instruct-eagle3 \
    --speculative-num-steps 3  \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 4
```

## Training Data

The model was trained on **1.4 million samples** sourced from the **Open-PerfectBlend** dataset. The data selection prioritizes high-quality instruction-following scenarios to maximize the draft model's predictive accuracy relative to the base model.

## Citation

If you use this model in your research or application, please cite the following:

```bibtex
@misc{kimik2eagle3,
  title={Kimi-K2-Instruct-eagle3: Accelerating Instruction Following with EAGLE},
  author={Ant AQ Team},
  year={2025},
}