| # MiniMax-M2.7-L3H5-DFlash |
|
|
| DFlash speculative-decoding drafter for [cyankiwi/MiniMax-M2.7-AWQ-4bit](https://huggingface.co/cyankiwi/MiniMax-M2.7-AWQ-4bit). |
|
|
| > β οΈ **Highly experimental.** Trained on a small (~2200-shard) on-policy corpus. |
| > Eval m_accept β 1.38 β useful for spec-decode infrastructure validation, **below |
| > the break-even point on Strix Halo TP=4** (which needs roughly m_accept β 3 to |
| > match no-spec throughput). Inference will currently be slower than no-spec on |
| > that hardware. |
|
|
| ## Architecture |
|
|
| - **3 drafter layers**, `hidden_size=3072`, **0.38B params** (drafter-only; embed + lm_head loaded from target at inference) |
| - target taps: layers `[2, 16, 30, 43, 57]` of MiniMax-M2.7's 62-layer target |
| - `block_size=16` |
| - all `full_attention` (target uses no SWA) |
| - `num_attention_heads=24`, `num_key_value_heads=8` (GQA), `head_dim=128` |
| - `vocab_size=200064`, `mask_token_id=200063` |
|
|
| ## Eval |
|
|
| Greedy-verification proxy on a 246-shard held-out set (1400 blocks), drawn |
| from the same on-policy corpus mix as training (agent-sessions + nemotron + codealpaca). |
|
|
| | step | m_accept | k=1 | k=2 | k=3 | k=4 | val_loss | |
| |---:|---:|---:|---:|---:|---:|---:| |
| | **26000** | **1.38** | **63.8%** | **37.3%** | **19.1%** | **8.7%** | 4.95 | |
|
|
| `m_accept` = mean leading run of greedy top-1 hits per block (max possible 15). |
| `k=N cumulative` = % of blocks where positions 1..N all hit top-1. |
|
|
| ## Use with vLLM |
|
|
| ```bash |
| vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \ |
| --tensor-parallel-size 4 \ |
| --speculative-config '{"method":"dflash","model":"MirecX/MiniMax-M2.7-L3H5-DFlash","num_speculative_tokens":4}' |
| ``` |
|
|
| `num_speculative_tokens=4` is a reasonable choice for this drafter: m_accept of |
| 1.38 means ideal speculative depth is β 1.5β2Γ = 3β4. Larger values waste |
| drafter compute on positions that rarely accept (k=4 acceptance is 8.7%, k=8 is |
| < 1%). |
| |
| ## Training recipe (paper-faithful) |
| |
| - 2211 on-policy training shards (mixed agent_sessions + nemotron + codealpaca prompts; target = MiniMax-M2.7-AWQ-4bit), 246 held-out shards |
| - 30000 optimizer steps, batch_size=1, grad_accum=2 (effective bs=2) |
| - `anchors_per_seq=6`, `loss_decay=0.85`, uncapped context window |
| - `block_size=16`, `mask_token_id=200063` |
| - frozen embed_tokens + lm_head (loaded from target's bf16 weights) |
|
|
| ## Caveats |
|
|
| - This is a relatively early checkpoint compared to z-lab's reference drafters |
| (those use ~800K samples; we use ~2K). Expect substantial gains from |
| continued training data. |
| - Tested only on the calibration distribution. Real-world prompts (long |
| contexts, code, multi-turn) will likely show lower acceptance. |
| - The 5-tap pattern targets layers spaced uniformly across MiniMax-M2.7's |
| 60-layer body (taps at ~3%, 26%, 50%, 71%, 94%); confirmed against |
| M2.5/M2.7 having identical architecture (62 hidden layers, hidden=3072). |
|
|
| ## Companion variants |
|
|
| - [MirecX/MiniMax-M2.7-L5H5-DFlash](https://huggingface.co/MirecX/MiniMax-M2.7-L5H5-DFlash) β 5-layer (0.60B), slightly higher m_accept at this data scale, ~35% slower per round |
| - [MirecX/MiniMax-M2.7-L4H6-DFlash](https://huggingface.co/MirecX/MiniMax-M2.7-L4H6-DFlash) β 4-layer, 6 taps (untrained shell) |
| |
| Built using the [DFlash](https://github.com/z-lab/dflash) framework. |
| |