Model Overview

kimi-k2.5-eagle3 is an Eagle3 MTP draft model for accelerating inference of Kimi-K2.5, trained with TorchSpec — an online speculative decoding training framework that runs FSDP training and inference concurrently. If you find this draft model useful, please give our project TorchSpec a 🌟 on GitHub.

Training data is available at lightseekorg/kimi-mtp-dataset.

Training Setup

Cluster: 4 nodes × 8× H200 (32 GPUs total)
Training: 2 nodes (16 GPUs), FSDP
Inference: 2 nodes (16 GPUs), Engine (TP=8 per node)
Duration: ~14 hours per phase

Training ran in two phases, each 20k steps (~300k samples):

Phase 1: Regenerated open-perfectblend dataset
Phase 2: Mixed dataset (English, VL, Chinese, function-call, agent, creative writing)

All training responses were regenerated by Kimi-K2.5 via Engine to match the base model's exact token distribution.

Training Curves

The plots show loss, token acceptance accuracy, and simulated accept_length during training. Both eval sets contain 256 samples drawn from each phase's own training corpus.

Phase 1 (steps 0 → 20k):

Phase 2 (steps 20k → 40k):

Performance

The primary metric is accept_length — the average number of tokens accepted per speculation step with topk=1, num_steps=3, num_draft_tokens=4. Higher is better.

Benchmarks were run using SpecForge's bench_eagle3.py. BFCL v3 benchmarks (†) use a custom extension to the original script.

Category	Dataset	n	Phase 1 (20k steps)	Phase 2 (40k steps)
Dialogue	MTBench	80	2.624	2.687
Chinese	CEval	212	1.482	2.295
Math	GSM8K	500	3.123	3.201
Code	HumanEval	164	3.242	3.285
Math	MATH500	500	3.323	3.342
Math	AIME	30	2.972	3.033
VL	MMStar	200	2.566	2.787
Function Call †	BFCL v3 simple	400	3.729	3.798
Function Call †	BFCL v3 multiple	200	3.745	3.809
Function Call †	BFCL v3 parallel	200	3.596	3.669
Function Call †	BFCL v3 parallel_multiple	200	3.525	3.601
Function Call †	BFCL v3 live_simple	1547	3.515	3.667
Function Call †	BFCL v3 live_multiple	1030	3.407	3.453
Function Call †	BFCL v3 live_parallel	97	3.303	3.410
Function Call †	BFCL v3 live_parallel_multiple	170	3.070	3.159

Quick Start

Requirements

NVIDIA GPU with CUDA 12.0+
vLLM >= (0.18.0) or you can install the nightly wheel/docker image.
SGLang ≥ 0.5.8

Launch Server (vLLM)

vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --speculative-config '{"model": "lightseekorg/kimi-k2.5-eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
    --trust-remote-code

For deployment configuration, refer to official vLLM recipes.

Launch Server (SGLang)

python -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2.5 \
    --tp 8 \
    --trust-remote-code \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --mem-fraction-static 0.75 \
    --dtype bfloat16

Run Benchmarks

python bench_eagle3.py \
    --model-path moonshotai/Kimi-K2.5 \
    --port 30000 \
    --config-list 1,3,1,4 \
    --benchmark-list <benchmark_name> \
    --skip-launch-server

--config-list format: topk,num_steps,topk,num_draft_tokens.

Downloads last month: 87,288

Safetensors

Model size

3B params

Tensor type

BF16

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lightseekorg/kimi-k2.5-eagle3

Base model

moonshotai/Kimi-K2.5

Finetuned

(43)

this model

Quantizations

1 model