--- license: mit --- ## Model Overview **Kimi-K2-Instruct-eagle3** is a specialized draft model designed to accelerate the inference of the Kimi-K2-Instruct ecosystem using the **EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency)** framework. Kimi-K2-Instruct with EAGLE3 achieves up to **1.8× peak throughput** versus the base model, accelerating generation across all 7 benchmarks—from +24% on MT-Bench to +80% on Math500 (configured with bs=8, steps=3, topk=1, num_draft_tokens=4). Built upon the **Llama architecture**, this model acts as a highly efficient drafter. It has been trained on **1.4 million high-quality samples** from the **Open-PerfectBlend** dataset, ensuring strict alignment with the teacher model's distribution. This model serves as a general-purpose English instruction follower with strong capabilities in: * **Conversation** * **Mathematical Reasoning** * **Code Generation** ## Efficient Download Guide To minimize download time and storage usage, please note the function of the files in the repository: **For Inference**: You only need to download config.json and model.safetensors. **For Continued Training**: The file training_state.pt contains optimizer states specifically for resuming training. If you only intend to use the model for inference, you can skip downloading this file. ## Performance & Acceleration The core value of this EAGLE model is its ability to predict multiple future tokens that are subsequently verified by the base model. High acceptance lengths indicate significant latency reduction. **Average Token Acceptance Lengths (MLA):** | Benchmark | Average Acceptance Length | | :--- | :--- | | **HumanEval** (Code) | **3.372** | | **GSM8K** (Math) | **3.165** | | **Math500** (Complex Math) | **3.490** | These metrics demonstrate robust acceleration performance across diverse and complex domains. ![1](https://hackmd.io/_uploads/ryP6cBLXbg.png) ![2](https://hackmd.io/_uploads/S1Da5BLmbl.png) ![3](https://hackmd.io/_uploads/S1v65HIm-e.png) ## Quick Start ### Requirements - NVIDIA GPU - CUDA 12.0+ - PyTorch 2.0+ ### Installation ```bash pip install sglang==0.5.6 ``` ### Inference with SGLang ```python python3 -m sglang.launch_server \ --model-path /models/Kimi-K2-Instruct \ --host 0.0.0.0 --port 30012 \ --trust-remote-code \ --attention-backend flashinfer \ --mem-fraction-static 0.9 \ --tp-size 8 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path AQ-MedAI/Kimi-K2-Instruct-eagle3 \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 ``` ## Training Data The model was trained on **1.4 million samples** sourced from the **Open-PerfectBlend** dataset. The data selection prioritizes high-quality instruction-following scenarios to maximize the draft model's predictive accuracy relative to the base model. ## Citation If you use this model in your research or application, please cite the following: ```bibtex @misc{kimik2eagle3, title={Kimi-K2-Instruct-eagle3: Accelerating Instruction Following with EAGLE}, author={Ant AQ Team}, year={2025}, }