| --- |
| license: apache-2.0 |
| base_model: moonshotai/Kimi-K2.7-Code |
| tags: |
| - speculative-decoding |
| - eagle3 |
| - mla |
| - draft-model |
| --- |
| |
| # Kimi-K2.7-Code Eagle3-MLA draft (32K-truncated vocab) |
|
|
| Eagle3-MLA speculative-decoding draft model for **Kimi-K2.7-Code**, with the |
| output `lm_head` truncated from the full 163,840 vocabulary to the **top-32,000 |
| highest-frequency tokens** (by real K2.7-Code serving-traffic token distribution), |
| plus the 256 special/template tokens force-included. |
|
|
| ## What changed vs the full-vocab draft |
| - `lm_head.weight`: `[163840, 7168]` -> `[32000, 7168]` |
| - added `d2t` (draft-local id -> target global id, delta-encoded) so vLLM scatters |
| the 32K draft logits back into the 163,840 target space |
| - `embed_tokens` kept full (`[163840, 7168]`) — draft input lookups are unaffected |
| - `config.json`: `draft_vocab_size: 32000` (was 163840) |
| - token coverage of the real draft-token distribution: **0.9927** |
|
|
| ## Architecture |
| Single-layer Eagle3 decoder on the DeepSeek-V2/V3 MLA attention |
| (`Eagle3DeepseekV2ForCausalLM`), `hidden_size=7168`. Loads in vLLM via |
| `--speculative-config '{"method":"eagle3","model":"<this repo>","num_speculative_tokens":3}'`. |
|
|
| ## Notes |
| The 32K truncation reduces the `lm_head` GEMM ~4x in isolation and is a clear win |
| at batch=1 / on-device decoding (memory-bandwidth-bound). At high-concurrency |
| EP+DP serving (e.g. c=128) the end-to-end gain is small, because the lm_head is |
| not the bottleneck there. Output correctness is unaffected — the target model |
| verifies every speculated token. |
| |