k-l-lambda
/

kimi-k2.7-code-eagle3-mla

speculative-decoding

Model card Files Files and versions

kimi-k2.7-code-eagle3-mla / README.md

k-l-lambda's picture

Update README for 32K vocab

cf71d08 verified about 14 hours ago

|

History Blame Contribute Delete

1.53 kB

	---
	license: apache-2.0
	base_model: moonshotai/Kimi-K2.7-Code
	tags:
	- speculative-decoding
	- eagle3
	- mla
	- draft-model
	---

	# Kimi-K2.7-Code Eagle3-MLA draft (32K-truncated vocab)

	Eagle3-MLA speculative-decoding draft model for Kimi-K2.7-Code, with the
	output `lm_head` truncated from the full 163,840 vocabulary to the **top-32,000
	highest-frequency tokens** (by real K2.7-Code serving-traffic token distribution),
	plus the 256 special/template tokens force-included.

	## What changed vs the full-vocab draft
	- `lm_head.weight`: `[163840, 7168]` -> `[32000, 7168]`
	- added `d2t` (draft-local id -> target global id, delta-encoded) so vLLM scatters
	the 32K draft logits back into the 163,840 target space
	- `embed_tokens` kept full (`[163840, 7168]`) — draft input lookups are unaffected
	- `config.json`: `draft_vocab_size: 32000` (was 163840)
	- token coverage of the real draft-token distribution: 0.9927

	## Architecture
	Single-layer Eagle3 decoder on the DeepSeek-V2/V3 MLA attention
	(`Eagle3DeepseekV2ForCausalLM`), `hidden_size=7168`. Loads in vLLM via
	`--speculative-config '{"method":"eagle3","model":"<this repo>","num_speculative_tokens":3}'`.

	## Notes
	The 32K truncation reduces the `lm_head` GEMM ~4x in isolation and is a clear win
	at batch=1 / on-device decoding (memory-bandwidth-bound). At high-concurrency
	EP+DP serving (e.g. c=128) the end-to-end gain is small, because the lm_head is
	not the bottleneck there. Output correctness is unaffected — the target model
	verifies every speculated token.