Qwen3-4B DFlash — Training Journey: M1 → M1.5 → M1.6 → M1.9

What we trained, on what data, with which recipe, and how the acceptance score moved at each step.

All scores are pooled acceptance τ = E[n]+1 (pooled over all verify cycles), measured on the same harness (scripts/eval_draft_8gpu_qwen3.sh: 8 SGLang servers, triton backend, target Qwen/Qwen3-4B, thinking-OFF, T=0, FULL prompts, B16). Higher = more tokens accepted per verify cycle (ceiling = B+1 = 17).

The reference bar is z-lab M0 = the published z-lab/Qwen3-4B-DFlash-b16 draft measured on our harness (so it is apples-to-apples; the only thing that differs from our runs is the draft weights).

TL;DR — score progression (pooled τ)

run	gsm8k	math500	humaneval	mbpp	mt-bench	avg	Δ avg
M1 (baseline)	5.86	6.60	5.02	4.69	2.76	4.99	—
M1.5	6.10	6.70	5.28	4.90	2.81	5.16	+0.17
M1.6 ep2 (best)	6.19	6.90	5.87	5.20	3.00	5.43	+0.27
M1.9 ep3 (best)	6.27	7.18	5.65	5.31	2.83	5.45	+0.02
z-lab M0 (target bar)	6.27	8.24	6.72	5.61	3.43	6.05	+0.60 ahead

Net: +0.46 avg from M1 → M1.9 (4.99 → 5.45), closing ~43% of the original 1.07-point gap to z-lab. gsm8k is now fully closed (6.27 = 6.27). The remaining gap is concentrated on math500 (−1.06) and humaneval (−1.07).

Stage detail

M1 — baseline (fresh, paper-faithful)

field	value
Init	from scratch (fresh draft)
Data	888,784 samples — the v10-balanced Gemma prompt set (891,503 prompts), a broad mix (NuminaMath, OpenMathReasoning, OpenMathInstruct-2, Nemotron-v2 math/code, open_code_instruct, evol_codealpaca, chat)
Labels	regenerated with Qwen/Qwen3-4B, greedy T=0, thinking-OFF, seqlen cap 3072
Loss	soft-label KD (pure forward-KL over full target vocab, α=0), decay-γ=7
Schedule	6 epochs, AdamW, lr 6e-4, cosine, warmup 0.04, grad-clip 1.0
Batch	GBS 64 (8/device × 8 GPU), 512 blocks/seq, ~83K steps total
Eval avg	4.99

This already matched z-lab's scale (≈800K) and epoch count (6) and most hyperparameters — yet landed 1.07 below z-lab. The gap shape (small on gsm8k, large on code/math) points at data composition, not scale: the v10 mix is a Gemma-era curation, not z-lab's clean Nemotron-PTD-v2 + CodeAlpaca sources. Every later run warm-starts off M1, so all inherit this base composition.

M1.5 — first continuation (unseen leftover, switch to EAL)

field	value
Init	warm-start from M1 (`qwen3-4b-m1-kd-b16-g7-rope1e6/final_checkpoint.pt`)
Data	374,110 samples — math 149,672 / code 149,803 / chat 74,617
Sources	prompts M1 never saw (hash-subtracted, eval-decontam'd): math = OpenMathInstruct-2 + Nemotron-v2 math + OpenMathReasoning; code = open_code_instruct + evol_codealpaca + ~30K function-completion (MBPP-train 374 + CodeSearchNet, HumanEval-shaped); chat = nem_ifc_chat (mt-bench anchor)
Loss	EAL (Expected-Acceptance-Length; uniform KL position weight)
Schedule	3 epochs (ckpts step5846 / 11692 / final)
Eval avg	5.16 (+0.17 vs M1)

First switch from KD → EAL loss, plus fresh unseen data. Lifted every benchmark a little; humaneval/mbpp still lagged → motivated M1.6's code focus.

M1.6 — code-completion + survival-EAL (biggest single jump)

field	value
Init	warm-start from M1.5
Data	224,024 samples — math 99,254 / code 99,921 / chat 24,840
Sources	math = NuminaMath-CoT HARD 100K (olympiads/AMC/AoPS + the `math` Hendrycks slice; decontam'd vs math500); code = function-completion 120K→~100K (`jinaai/code_exercises` + `bigcode/self-oss-instruct`, exact HumanEval/MBPP shape); chat retention
Loss	survival-EAL (NEW: KL position weights = survival-based, so deeper block positions are weighted by their probability of still being "alive")
Schedule	3 epochs (ckpts step3501 / 7002 / final). Best = ep2 (ep3 over-trains, esp. humaneval)
Eval avg	5.43 (+0.27 vs M1.5) — largest step

Two levers together: (1) function-completion code (matching HumanEval/MBPP format) → humaneval 5.28→5.87, mbpp 4.90→5.20; (2) survival-EAL → uniform lift everywhere. qwen3_m16_ep2 is the promoted checkpoint and the warm-start base for M1.7/M1.8/M1.9.

M1.9 — math-spine (unique hard-math volume)

field	value
Init	warm-start from M1.6 ep2 (`checkpoint_step7002.pt`)
Data	223,846 samples — math 163,846 / code 30,000 / chat 30,000
Sources	math = the 143K unused NuminaMath-CoT HARD (never trained in M1/M1.5/M1.6) + 22K synthetic_math; code/chat = anchors reused from M1.8 (instruction-style)
Labels	Qwen3-4B greedy thinking-OFF, max_new_tokens 3072 (~22% length-capped)
Loss	survival-EAL (same as M1.6)
Schedule	3 epochs (ckpts step3498 / 6996 / final). Best math500 = ep3
Eval ep3	5.45 avg; math500 7.18 (new best); gsm8k 6.27 (= z-lab)

The diagnosis driving M1.9: a label audit confirmed our math labels already match z-lab's verbose long-LaTeX style (mean ~1,716 tok, ~85 LaTeX markers, 78.8% \boxed), so the math500 gap is unique-hard-math coverage, not label quality. Result: math500 monotonic 6.90 → 7.18, gsm8k closed — but two small regressions from thinning the anchors: mt-bench 3.00→2.83 (chat anchor only 30K) and humaneval 5.87→5.65 (instruction-style code instead of M1.6's function-completion).

Side experiments off M1.6 ep2 (not on the main line): M1.7 (frequency-capped prompt repetition) and M1.8 (raw nem_v2 diversity expansion) each moved math500 only ~+0.14 (to ~7.04) and did not beat M1.6's avg. The diminishing returns across repetition → diversity → volume (each ~+0.14) are the signature of corpus saturation on this NuminaMath-based math source.

How the loss function evolved

stage	loss	idea
M1	soft-label KD	match the target's full per-position distribution (forward-KL)
M1.5	EAL	directly maximize Expected Acceptance Length (reward-shaped); uniform KL weighting across block positions
M1.6 → M1.9	survival-EAL	EAL + KL position weights scaled by each position's survival probability (chance the block is still accepted at that depth) — focuses capacity on positions that actually get verified

EAL is implemented as a negative, reward-maximizing objective (a sign flip on the acceptance reward); survival weighting was the M1.6 refinement that produced the largest single avg jump.

Where the remaining gap is (vs z-lab M0 = 6.05 avg)

benchmark	best of ours	z-lab M0	gap
gsm8k	6.27	6.27	0.00 ✅
math500	7.18	8.24	−1.06
humaneval	5.87	6.72	−0.85
mbpp	5.35	5.61	−0.26
mt-bench	3.00	3.43	−0.43

Key conclusion: scale and epochs were already matched at M1 (888K / 6 epochs), so the residual gap is not a volume problem. It traces to data composition — M1 (and thus all warm-started descendants) was built on the v10-balanced mix, not z-lab's exact Nemotron-PTD-v2 + evol-codealpaca sources. The cleanest untested experiment is a fresh full-scale run on z-lab's exact source composition, not more fine-tuning on the inherited v10 base.

Last updated: 2026-06-14.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support