kimi-k2.6-eagle3-mla-fp8

This repository is an FP8-converted copy of lightseekorg/kimi-k2.6-eagle3-mla, an Eagle3 MLA draft model for speculative decoding with moonshotai/Kimi-K2.6.

The original draft model weights were converted to vLLM-compatible FP8 tensor format with static activation calibration. The conversion keeps embeddings and the LM head unquantized and stores the FP8 metadata in config.json under quantization_config.

Local validation in our vLLM Kimi-K2.6 setup used:

{"model":"festr2/kimi-k2.6-eagle3-mla-fp8","method":"eagle3","num_speculative_tokens":3,"draft_attention_backend":"TRITON_MLA"}

In that setup, this FP8 draft preserved similar acceptance to the original K2.6 draft and was slightly faster in our cc1/cc32 decode checks. Exact throughput depends on the vLLM build, CUDA/NCCL stack, GPU topology, and launch parameters.

The rest of this model card is based on the original LightSeek model card.

Model Overview

kimi-k2.6-eagle3-mla is an Eagle3 MTP draft model with MLA (Multi-Latent Attention) for accelerating inference of Kimi-K2.6, trained with TorchSpec — an online speculative decoding training framework that runs FSDP training and inference concurrently. If you find this draft model useful, please give our project TorchSpec a star on GitHub.

Why an MLA (Multi-Latent Attention) Draft Model

Compared with an MHA draft model, the MLA variant is a better fit for Kimi-K2.6 deployment:

  • Uses less KV cache, which reduces serving memory pressure.
  • Matches Kimi-K2.6's MLA architecture, so it fits more naturally into the inference engine's KV-cache handling under different serving scenarios such as PD-Disaggregation.

Training Setup

  • Cluster: 3 nodes × 8× B200 (24 GPUs total)
  • Training: 1 node (8 GPUs), FSDP
  • Inference: 2 nodes (16 GPUs), vLLM (TP=8 per node)
  • Continual training: Initialized from kimi-k2.5-eagle3-mla checkpoint
  • Iterations: 9,279 steps
  • Learning rate: 2e-5, cosine schedule

Performance

The primary metric is accept_length — the average number of tokens accepted per speculation step with num_speculative_tokens=3. Higher is better.

Benchmarks were run on vLLM 0.20.0 with 8× B200 GPUs.

Category Benchmark N Accept Length
Dialogue MTBench 80 2.624
Chinese CEval 212 2.494
Math GSM8K 500 2.987
Code HumanEval 164 3.241
Math MATH500 500 3.245
Math AIME 30 2.982
Code LiveCodeBench 200 2.706
Code SPEED-Bench (coding) 80 3.006

Quick Start

Requirements

  • NVIDIA GPU with CUDA 12.0+
  • vLLM >= 0.20.0

Launch Server (vLLM)

vllm serve moonshotai/Kimi-K2.6 \
    --tensor-parallel-size 8 \
    --attention-backend TRITON_MLA \
    --kv-cache-dtype fp8 \
    --speculative-config '{"model": "festr2/kimi-k2.6-eagle3-mla-fp8", "method": "eagle3", "num_speculative_tokens": 3, "draft_attention_backend": "TRITON_MLA"}' \
    --trust-remote-code

Launch Server (SGLang)

MLA Eagle3 draft model is not yet supported in SGLang. Will update once support is available.

Citation

@misc{torchspec2026,
  title={TorchSpec: An Online Speculative Decoding Training Framework},
  url={https://github.com/torchspec-project/TorchSpec},
  year={2026}
}
Downloads last month
1,316
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for festr2/kimi-k2.6-eagle3-mla-fp8

Quantized
(33)
this model