L3H5 v12 step_026000: m_accept 1.38, k=1 63.8%, 2211-shard training

f9f4303 verified about 1 month ago

3.25 kB

	# MiniMax-M2.7-L3H5-DFlash

	DFlash speculative-decoding drafter for [cyankiwi/MiniMax-M2.7-AWQ-4bit](https://huggingface.co/cyankiwi/MiniMax-M2.7-AWQ-4bit).

	> ⚠️ Highly experimental. Trained on a small (~2200-shard) on-policy corpus.
	> Eval m_accept ≈ 1.38 — useful for spec-decode infrastructure validation, **below
	> the break-even point on Strix Halo TP=4** (which needs roughly m_accept ≈ 3 to
	> match no-spec throughput). Inference will currently be slower than no-spec on
	> that hardware.

	## Architecture

	- 3 drafter layers, `hidden_size=3072`, 0.38B params (drafter-only; embed + lm_head loaded from target at inference)
	- target taps: layers `[2, 16, 30, 43, 57]` of MiniMax-M2.7's 62-layer target
	- `block_size=16`
	- all `full_attention` (target uses no SWA)
	- `num_attention_heads=24`, `num_key_value_heads=8` (GQA), `head_dim=128`
	- `vocab_size=200064`, `mask_token_id=200063`

	## Eval

	Greedy-verification proxy on a 246-shard held-out set (1400 blocks), drawn
	from the same on-policy corpus mix as training (agent-sessions + nemotron + codealpaca).

	\| step \| m_accept \| k=1 \| k=2 \| k=3 \| k=4 \| val_loss \|
	\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 26000 \| 1.38 \| 63.8% \| 37.3% \| 19.1% \| 8.7% \| 4.95 \|

	`m_accept` = mean leading run of greedy top-1 hits per block (max possible 15).
	`k=N cumulative` = % of blocks where positions 1..N all hit top-1.

	## Use with vLLM

	```bash
	vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
	--tensor-parallel-size 4 \
	--speculative-config '{"method":"dflash","model":"MirecX/MiniMax-M2.7-L3H5-DFlash","num_speculative_tokens":4}'
	```

	`num_speculative_tokens=4` is a reasonable choice for this drafter: m_accept of
	1.38 means ideal speculative depth is ≈ 1.5–2× = 3–4. Larger values waste
	drafter compute on positions that rarely accept (k=4 acceptance is 8.7%, k=8 is
	< 1%).

	## Training recipe (paper-faithful)

	- 2211 on-policy training shards (mixed agent_sessions + nemotron + codealpaca prompts; target = MiniMax-M2.7-AWQ-4bit), 246 held-out shards
	- 30000 optimizer steps, batch_size=1, grad_accum=2 (effective bs=2)
	- `anchors_per_seq=6`, `loss_decay=0.85`, uncapped context window
	- `block_size=16`, `mask_token_id=200063`
	- frozen embed_tokens + lm_head (loaded from target's bf16 weights)

	## Caveats

	- This is a relatively early checkpoint compared to z-lab's reference drafters
	(those use ~800K samples; we use ~2K). Expect substantial gains from
	continued training data.
	- Tested only on the calibration distribution. Real-world prompts (long
	contexts, code, multi-turn) will likely show lower acceptance.
	- The 5-tap pattern targets layers spaced uniformly across MiniMax-M2.7's
	60-layer body (taps at ~3%, 26%, 50%, 71%, 94%); confirmed against
	M2.5/M2.7 having identical architecture (62 hidden layers, hidden=3072).

	## Companion variants

	- [MirecX/MiniMax-M2.7-L5H5-DFlash](https://huggingface.co/MirecX/MiniMax-M2.7-L5H5-DFlash) — 5-layer (0.60B), slightly higher m_accept at this data scale, ~35% slower per round
	- [MirecX/MiniMax-M2.7-L4H6-DFlash](https://huggingface.co/MirecX/MiniMax-M2.7-L4H6-DFlash) — 4-layer, 6 taps (untrained shell)

	Built using the [DFlash](https://github.com/z-lab/dflash) framework.