kimi-k2.7-code-eagle3-mla

Model Overview

kimi-k2.7-code-eagle3-mla is an Eagle3 MTP draft model with MLA (Multi-Latent Attention) for accelerating inference of Kimi-K2.7-Code under vLLM speculative decoding. The draft proposes num_speculative_tokens candidate tokens per step; the Kimi-K2.7-Code verifier accepts them in parallel, so the output distribution is identical to plain autoregressive decoding while decode throughput improves.

Why an MLA (Multi-Latent Attention) Draft Model

Compared with an MHA draft model, the MLA variant is a better fit for Kimi-K2.7-Code deployment:

  • Uses less KV cache, which reduces serving memory pressure.
  • Matches Kimi-K2.7-Code's MLA architecture, so it fits more naturally into the inference engine's KV-cache handling under different serving scenarios such as PD-Disaggregation.

Architecture

  • Algorithm: EAGLE-3 with MLA, single draft decoder layer.
  • Verifier: Kimi-K2.7-Code. The draft reuses the verifier's frozen embedding / lm_head / norm and trains one MLA decoder layer plus an auxiliary-hidden-state fusion layer.
  • Draft vocabulary: full 163,840-token vocabulary (no truncation).

Training Setup

  • Framework: Camelot, an online speculative-decoding training framework โ€” FSDP training and vLLM inference run concurrently, with the verifier continuously generating fresh training data.
  • Training data: Kimi-K2.7-Code native data (agentic / coding / tool trajectories and re-answered prompts).
  • Schedule: cosine LR 2e-5, sequence length 8192, ttt_steps=4.

Performance

The primary metric is accept_length โ€” the average number of tokens accepted per speculation step with num_speculative_tokens=3. Higher is better.

Benchmarks were run on vLLM 0.20.0 (TP=8, greedy decoding, concurrency=1) against the Kimi-K2.7-Code verifier.

Category Benchmark N Accept Length
Dialogue MTBench 80 2.427
Chinese CEval 212 2.348
Math GSM8K 500 3.201
Code HumanEval 164 2.738
Math MATH500 500 2.918
Math AIME 30 2.542
Code LiveCodeBench 200 2.362
Code SPEED-Bench (coding) 80 2.515

Quick Start

Requirements

  • NVIDIA GPU with CUDA 12.0+
  • vLLM >= 0.20.0

Launch Server (vLLM)

vllm serve moonshotai/Kimi-K2.7-Code \
    --tensor-parallel-size 8 \
    --speculative-config '{"model": "novita/kimi-k2.7-code-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}' \
    --trust-remote-code

Launch Server (SGLang)

MLA Eagle3 draft model is not yet supported in SGLang. Will update once support is available.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for novita/kimi-k2.7-code-eagle3-mla

Finetuned
(6)
this model