MirecX's picture
L3H5 v12 step_026000: m_accept 1.38, k=1 63.8%, 2211-shard training
f9f4303 verified

MiniMax-M2.7-L3H5-DFlash

DFlash speculative-decoding drafter for cyankiwi/MiniMax-M2.7-AWQ-4bit.

⚠️ Highly experimental. Trained on a small (~2200-shard) on-policy corpus. Eval m_accept ≈ 1.38 — useful for spec-decode infrastructure validation, below the break-even point on Strix Halo TP=4 (which needs roughly m_accept ≈ 3 to match no-spec throughput). Inference will currently be slower than no-spec on that hardware.

Architecture

  • 3 drafter layers, hidden_size=3072, 0.38B params (drafter-only; embed + lm_head loaded from target at inference)
  • target taps: layers [2, 16, 30, 43, 57] of MiniMax-M2.7's 62-layer target
  • block_size=16
  • all full_attention (target uses no SWA)
  • num_attention_heads=24, num_key_value_heads=8 (GQA), head_dim=128
  • vocab_size=200064, mask_token_id=200063

Eval

Greedy-verification proxy on a 246-shard held-out set (1400 blocks), drawn from the same on-policy corpus mix as training (agent-sessions + nemotron + codealpaca).

step m_accept k=1 k=2 k=3 k=4 val_loss
26000 1.38 63.8% 37.3% 19.1% 8.7% 4.95

m_accept = mean leading run of greedy top-1 hits per block (max possible 15). k=N cumulative = % of blocks where positions 1..N all hit top-1.

Use with vLLM

vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 \
  --speculative-config '{"method":"dflash","model":"MirecX/MiniMax-M2.7-L3H5-DFlash","num_speculative_tokens":4}'

num_speculative_tokens=4 is a reasonable choice for this drafter: m_accept of 1.38 means ideal speculative depth is ≈ 1.5–2× = 3–4. Larger values waste drafter compute on positions that rarely accept (k=4 acceptance is 8.7%, k=8 is < 1%).

Training recipe (paper-faithful)

  • 2211 on-policy training shards (mixed agent_sessions + nemotron + codealpaca prompts; target = MiniMax-M2.7-AWQ-4bit), 246 held-out shards
  • 30000 optimizer steps, batch_size=1, grad_accum=2 (effective bs=2)
  • anchors_per_seq=6, loss_decay=0.85, uncapped context window
  • block_size=16, mask_token_id=200063
  • frozen embed_tokens + lm_head (loaded from target's bf16 weights)

Caveats

  • This is a relatively early checkpoint compared to z-lab's reference drafters (those use ~800K samples; we use ~2K). Expect substantial gains from continued training data.
  • Tested only on the calibration distribution. Real-world prompts (long contexts, code, multi-turn) will likely show lower acceptance.
  • The 5-tap pattern targets layers spaced uniformly across MiniMax-M2.7's 60-layer body (taps at ~3%, 26%, 50%, 71%, 94%); confirmed against M2.5/M2.7 having identical architecture (62 hidden layers, hidden=3072).

Companion variants

Built using the DFlash framework.