File size: 1,530 Bytes
b48c69f
cf71d08
696cbee
 
cf71d08
 
 
 
b48c69f
696cbee
cf71d08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
license: apache-2.0
base_model: moonshotai/Kimi-K2.7-Code
tags:
- speculative-decoding
- eagle3
- mla
- draft-model
---

# Kimi-K2.7-Code Eagle3-MLA draft (32K-truncated vocab)

Eagle3-MLA speculative-decoding draft model for **Kimi-K2.7-Code**, with the
output `lm_head` truncated from the full 163,840 vocabulary to the **top-32,000
highest-frequency tokens** (by real K2.7-Code serving-traffic token distribution),
plus the 256 special/template tokens force-included.

## What changed vs the full-vocab draft
- `lm_head.weight`: `[163840, 7168]` -> `[32000, 7168]`
- added `d2t` (draft-local id -> target global id, delta-encoded) so vLLM scatters
  the 32K draft logits back into the 163,840 target space
- `embed_tokens` kept full (`[163840, 7168]`) — draft input lookups are unaffected
- `config.json`: `draft_vocab_size: 32000` (was 163840)
- token coverage of the real draft-token distribution: **0.9927**

## Architecture
Single-layer Eagle3 decoder on the DeepSeek-V2/V3 MLA attention
(`Eagle3DeepseekV2ForCausalLM`), `hidden_size=7168`. Loads in vLLM via
`--speculative-config '{"method":"eagle3","model":"<this repo>","num_speculative_tokens":3}'`.

## Notes
The 32K truncation reduces the `lm_head` GEMM ~4x in isolation and is a clear win
at batch=1 / on-device decoding (memory-bandwidth-bound). At high-concurrency
EP+DP serving (e.g. c=128) the end-to-end gain is small, because the lm_head is
not the bottleneck there. Output correctness is unaffected — the target model
verifies every speculated token.