license: apache-2.0
base_model: moonshotai/Kimi-K2.7-Code
tags:
- speculative-decoding
- eagle3
- mla
- draft-model
Kimi-K2.7-Code Eagle3-MLA draft (32K-truncated vocab)
Eagle3-MLA speculative-decoding draft model for Kimi-K2.7-Code, with the
output lm_head truncated from the full 163,840 vocabulary to the top-32,000
highest-frequency tokens (by real K2.7-Code serving-traffic token distribution),
plus the 256 special/template tokens force-included.
What changed vs the full-vocab draft
lm_head.weight:[163840, 7168]->[32000, 7168]- added
d2t(draft-local id -> target global id, delta-encoded) so vLLM scatters the 32K draft logits back into the 163,840 target space embed_tokenskept full ([163840, 7168]) — draft input lookups are unaffectedconfig.json:draft_vocab_size: 32000(was 163840)- token coverage of the real draft-token distribution: 0.9927
Architecture
Single-layer Eagle3 decoder on the DeepSeek-V2/V3 MLA attention
(Eagle3DeepseekV2ForCausalLM), hidden_size=7168. Loads in vLLM via
--speculative-config '{"method":"eagle3","model":"<this repo>","num_speculative_tokens":3}'.
Notes
The 32K truncation reduces the lm_head GEMM ~4x in isolation and is a clear win
at batch=1 / on-device decoding (memory-bandwidth-bound). At high-concurrency
EP+DP serving (e.g. c=128) the end-to-end gain is small, because the lm_head is
not the bottleneck there. Output correctness is unaffected — the target model
verifies every speculated token.