kimi-k2.7-code-eagle3-mla
Model Overview
kimi-k2.7-code-eagle3-mla is an Eagle3 MTP draft model with MLA (Multi-Latent Attention) for
accelerating inference of Kimi-K2.7-Code under vLLM speculative decoding. The draft proposes
num_speculative_tokens candidate tokens per step; the Kimi-K2.7-Code verifier accepts them in
parallel, so the output distribution is identical to plain autoregressive decoding while decode
throughput improves.
Why an MLA (Multi-Latent Attention) Draft Model
Compared with an MHA draft model, the MLA variant is a better fit for Kimi-K2.7-Code deployment:
- Uses less KV cache, which reduces serving memory pressure.
- Matches Kimi-K2.7-Code's MLA architecture, so it fits more naturally into the inference engine's KV-cache handling under different serving scenarios such as PD-Disaggregation.
Architecture
- Algorithm: EAGLE-3 with MLA, single draft decoder layer.
- Verifier: Kimi-K2.7-Code. The draft reuses the verifier's frozen embedding / lm_head / norm and trains one MLA decoder layer plus an auxiliary-hidden-state fusion layer.
- Draft vocabulary: full 163,840-token vocabulary (no truncation).
Training Setup
- Framework: Camelot, an online speculative-decoding training framework โ FSDP training and vLLM inference run concurrently, with the verifier continuously generating fresh training data.
- Training data: Kimi-K2.7-Code native data (agentic / coding / tool trajectories and re-answered prompts).
- Schedule: cosine LR 2e-5, sequence length 8192,
ttt_steps=4.
Performance
The primary metric is accept_length โ the average number of tokens accepted per speculation
step with num_speculative_tokens=3. Higher is better.
Benchmarks were run on vLLM 0.20.0 (TP=8, greedy decoding, concurrency=1) against the Kimi-K2.7-Code verifier.
| Category | Benchmark | N | Accept Length |
|---|---|---|---|
| Dialogue | MTBench | 80 | 2.427 |
| Chinese | CEval | 212 | 2.348 |
| Math | GSM8K | 500 | 3.201 |
| Code | HumanEval | 164 | 2.738 |
| Math | MATH500 | 500 | 2.918 |
| Math | AIME | 30 | 2.542 |
| Code | LiveCodeBench | 200 | 2.362 |
| Code | SPEED-Bench (coding) | 80 | 2.515 |
Quick Start
Requirements
- NVIDIA GPU with CUDA 12.0+
- vLLM >= 0.20.0
Launch Server (vLLM)
vllm serve moonshotai/Kimi-K2.7-Code \
--tensor-parallel-size 8 \
--speculative-config '{"model": "novita/kimi-k2.7-code-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}' \
--trust-remote-code
Launch Server (SGLang)
MLA Eagle3 draft model is not yet supported in SGLang. Will update once support is available.
- Downloads last month
- -
Model tree for novita/kimi-k2.7-code-eagle3-mla
Base model
moonshotai/Kimi-K2.7-Code