Fix README (was empty): real accept_len + distribution-shift note

32306db verified 17 days ago

2.41 kB

	---
	license: other
	base_model: moonshotai/Kimi-K2.6
	tags:
	- text-generation
	- speculative-decoding
	- eagle3
	- kimi-k2.6
	- mla
	- torchspec
	---

	# kimi-k2.6-eagle3-mla

	Eagle3 MTP draft model with MLA (Multi-Latent Attention) for accelerating
	inference of [Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6).

	This is a fine-tuned draft, anchored to the official
	[lightseekorg/kimi-k2.6-eagle3-mla](https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla)
	initialization. It targets multi-hop (downstream-position) acceptance while
	preserving the first-hop gain, evaluated by runtime accept-length on a frozen
	full-context held-out set.

	## Fine-tune setup

	- Init: lightseekorg/kimi-k2.6-eagle3-mla (official MLA weights)
	- Objective: Eagle3 distillation + multi-step TTT supervision
	(`ttt_steps=4`, `ttt_step_loss_decay=1.0`, off-policy downstream tokens)
	- Anti-over-specialization: L2-SP weight-space anchor toward the init
	(penalize trainable-param drift; lambda=1e-4)
	- Optimizer: lr 2e-5, cosine schedule
	- Checkpoint: best by held-out validation loss on the K2.6 ruler
	(step 95400; val_loss 5.490, the global minimum of the v3 run)

	## Performance

	Primary metric is accept_length — average tokens accepted per speculation
	step with `num_speculative_tokens=3` (higher is better). Evaluated with
	vLLM 0.20.0 on 8x H200, TP=8, max-model-len 32768, greedy.

	On a frozen K2.6 full-context held-out judge set (914 prompts):

	\| Model \| accept_len \|
	\|-------\|-----------:\|
	\| lightseek (official init) \| 2.285 \|
	\| this model \| 2.308 \|

	This draft improves over the official init on the K2.6 held-out distribution.

	## Note on distribution shift

	This checkpoint is selected by validation loss on the K2.6 teacher
	distribution. In cross-version testing against real Kimi-K2.7-Code production
	traffic, the official lightseek init currently shows higher accept-length than
	this fine-tune — i.e. the K2.6 fine-tune over-specializes to its training
	distribution. If your serving traffic differs substantially from long
	multi-turn K2.6 dialogues, benchmark both this draft and the lightseek init on
	your own traffic before choosing. (The L2-SP anchor above is intended to
	mitigate this; tuning it against real-traffic accept-length is ongoing.)

	## Usage

	Serve with vLLM as the speculative draft for Kimi-K2.6, with
	`num_speculative_tokens=3` in the speculative-config.