| --- |
| license: other |
| base_model: moonshotai/Kimi-K2.6 |
| tags: |
| - text-generation |
| - speculative-decoding |
| - eagle3 |
| - kimi-k2.6 |
| - mla |
| - torchspec |
| --- |
| |
| # kimi-k2.6-eagle3-mla |
|
|
| Eagle3 MTP draft model with MLA (Multi-Latent Attention) for accelerating |
| inference of [Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6). |
|
|
| This is a fine-tuned draft, anchored to the official |
| [lightseekorg/kimi-k2.6-eagle3-mla](https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla) |
| initialization. It targets multi-hop (downstream-position) acceptance while |
| preserving the first-hop gain, evaluated by runtime accept-length on a frozen |
| full-context held-out set. |
|
|
| ## Fine-tune setup |
|
|
| - **Init**: lightseekorg/kimi-k2.6-eagle3-mla (official MLA weights) |
| - **Objective**: Eagle3 distillation + multi-step TTT supervision |
| (`ttt_steps=4`, `ttt_step_loss_decay=1.0`, off-policy downstream tokens) |
| - **Anti-over-specialization**: L2-SP weight-space anchor toward the init |
| (penalize trainable-param drift; lambda=1e-4) |
| - **Optimizer**: lr 2e-5, cosine schedule |
| - **Checkpoint**: best by held-out validation loss on the K2.6 ruler |
| (step 95400; val_loss 5.490, the global minimum of the v3 run) |
| |
| ## Performance |
| |
| Primary metric is **accept_length** — average tokens accepted per speculation |
| step with `num_speculative_tokens=3` (higher is better). Evaluated with |
| vLLM 0.20.0 on 8x H200, TP=8, max-model-len 32768, greedy. |
| |
| On a frozen K2.6 full-context held-out judge set (914 prompts): |
| |
| | Model | accept_len | |
| |-------|-----------:| |
| | lightseek (official init) | 2.285 | |
| | this model | **2.308** | |
|
|
| This draft improves over the official init on the K2.6 held-out distribution. |
|
|
| ## Note on distribution shift |
|
|
| This checkpoint is selected by validation loss on the K2.6 teacher |
| distribution. In cross-version testing against real Kimi-K2.7-Code production |
| traffic, the official lightseek init currently shows higher accept-length than |
| this fine-tune — i.e. the K2.6 fine-tune over-specializes to its training |
| distribution. If your serving traffic differs substantially from long |
| multi-turn K2.6 dialogues, benchmark both this draft and the lightseek init on |
| your own traffic before choosing. (The L2-SP anchor above is intended to |
| mitigate this; tuning it against real-traffic accept-length is ongoing.) |
|
|
| ## Usage |
|
|
| Serve with vLLM as the speculative draft for Kimi-K2.6, with |
| `num_speculative_tokens=3` in the speculative-config. |
|
|