YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen3-4B DFlash β Training Journey: M1 β M1.5 β M1.6 β M1.9
What we trained, on what data, with which recipe, and how the acceptance score moved at each step.
All scores are pooled acceptance Ο = E[n]+1 (pooled over all verify cycles),
measured on the same harness (scripts/eval_draft_8gpu_qwen3.sh: 8 SGLang
servers, triton backend, target Qwen/Qwen3-4B, thinking-OFF, T=0, FULL
prompts, B16). Higher = more tokens accepted per verify cycle (ceiling = B+1 = 17).
The reference bar is z-lab M0 = the published z-lab/Qwen3-4B-DFlash-b16
draft measured on our harness (so it is apples-to-apples; the only thing that
differs from our runs is the draft weights).
TL;DR β score progression (pooled Ο)
| run | gsm8k | math500 | humaneval | mbpp | mt-bench | avg | Ξ avg |
|---|---|---|---|---|---|---|---|
| M1 (baseline) | 5.86 | 6.60 | 5.02 | 4.69 | 2.76 | 4.99 | β |
| M1.5 | 6.10 | 6.70 | 5.28 | 4.90 | 2.81 | 5.16 | +0.17 |
| M1.6 ep2 (best) | 6.19 | 6.90 | 5.87 | 5.20 | 3.00 | 5.43 | +0.27 |
| M1.9 ep3 (best) | 6.27 | 7.18 | 5.65 | 5.31 | 2.83 | 5.45 | +0.02 |
| z-lab M0 (target bar) | 6.27 | 8.24 | 6.72 | 5.61 | 3.43 | 6.05 | +0.60 ahead |
Net: +0.46 avg from M1 β M1.9 (4.99 β 5.45), closing ~43% of the original 1.07-point gap to z-lab. gsm8k is now fully closed (6.27 = 6.27). The remaining gap is concentrated on math500 (β1.06) and humaneval (β1.07).
Stage detail
M1 β baseline (fresh, paper-faithful)
| field | value |
|---|---|
| Init | from scratch (fresh draft) |
| Data | 888,784 samples β the v10-balanced Gemma prompt set (891,503 prompts), a broad mix (NuminaMath, OpenMathReasoning, OpenMathInstruct-2, Nemotron-v2 math/code, open_code_instruct, evol_codealpaca, chat) |
| Labels | regenerated with Qwen/Qwen3-4B, greedy T=0, thinking-OFF, seqlen cap 3072 |
| Loss | soft-label KD (pure forward-KL over full target vocab, Ξ±=0), decay-Ξ³=7 |
| Schedule | 6 epochs, AdamW, lr 6e-4, cosine, warmup 0.04, grad-clip 1.0 |
| Batch | GBS 64 (8/device Γ 8 GPU), 512 blocks/seq, ~83K steps total |
| Eval avg | 4.99 |
This already matched z-lab's scale (β800K) and epoch count (6) and most hyperparameters β yet landed 1.07 below z-lab. The gap shape (small on gsm8k, large on code/math) points at data composition, not scale: the v10 mix is a Gemma-era curation, not z-lab's clean Nemotron-PTD-v2 + CodeAlpaca sources. Every later run warm-starts off M1, so all inherit this base composition.
M1.5 β first continuation (unseen leftover, switch to EAL)
| field | value |
|---|---|
| Init | warm-start from M1 (qwen3-4b-m1-kd-b16-g7-rope1e6/final_checkpoint.pt) |
| Data | 374,110 samples β math 149,672 / code 149,803 / chat 74,617 |
| Sources | prompts M1 never saw (hash-subtracted, eval-decontam'd): math = OpenMathInstruct-2 + Nemotron-v2 math + OpenMathReasoning; code = open_code_instruct + evol_codealpaca + ~30K function-completion (MBPP-train 374 + CodeSearchNet, HumanEval-shaped); chat = nem_ifc_chat (mt-bench anchor) |
| Loss | EAL (Expected-Acceptance-Length; uniform KL position weight) |
| Schedule | 3 epochs (ckpts step5846 / 11692 / final) |
| Eval avg | 5.16 (+0.17 vs M1) |
First switch from KD β EAL loss, plus fresh unseen data. Lifted every benchmark a little; humaneval/mbpp still lagged β motivated M1.6's code focus.
M1.6 β code-completion + survival-EAL (biggest single jump)
| field | value |
|---|---|
| Init | warm-start from M1.5 |
| Data | 224,024 samples β math 99,254 / code 99,921 / chat 24,840 |
| Sources | math = NuminaMath-CoT HARD 100K (olympiads/AMC/AoPS + the math Hendrycks slice; decontam'd vs math500); code = function-completion 120Kβ~100K (jinaai/code_exercises + bigcode/self-oss-instruct, exact HumanEval/MBPP shape); chat retention |
| Loss | survival-EAL (NEW: KL position weights = survival-based, so deeper block positions are weighted by their probability of still being "alive") |
| Schedule | 3 epochs (ckpts step3501 / 7002 / final). Best = ep2 (ep3 over-trains, esp. humaneval) |
| Eval avg | 5.43 (+0.27 vs M1.5) β largest step |
Two levers together: (1) function-completion code (matching HumanEval/MBPP
format) β humaneval 5.28β5.87, mbpp 4.90β5.20; (2) survival-EAL β uniform
lift everywhere. qwen3_m16_ep2 is the promoted checkpoint and the warm-start
base for M1.7/M1.8/M1.9.
M1.9 β math-spine (unique hard-math volume)
| field | value |
|---|---|
| Init | warm-start from M1.6 ep2 (checkpoint_step7002.pt) |
| Data | 223,846 samples β math 163,846 / code 30,000 / chat 30,000 |
| Sources | math = the 143K unused NuminaMath-CoT HARD (never trained in M1/M1.5/M1.6) + 22K synthetic_math; code/chat = anchors reused from M1.8 (instruction-style) |
| Labels | Qwen3-4B greedy thinking-OFF, max_new_tokens 3072 (~22% length-capped) |
| Loss | survival-EAL (same as M1.6) |
| Schedule | 3 epochs (ckpts step3498 / 6996 / final). Best math500 = ep3 |
| Eval ep3 | 5.45 avg; math500 7.18 (new best); gsm8k 6.27 (= z-lab) |
The diagnosis driving M1.9: a label audit confirmed our math labels already match
z-lab's verbose long-LaTeX style (mean ~1,716 tok, ~85 LaTeX markers, 78.8%
\boxed), so the math500 gap is unique-hard-math coverage, not label quality.
Result: math500 monotonic 6.90 β 7.18, gsm8k closed β but two small
regressions from thinning the anchors: mt-bench 3.00β2.83 (chat anchor only 30K)
and humaneval 5.87β5.65 (instruction-style code instead of M1.6's function-completion).
Side experiments off M1.6 ep2 (not on the main line): M1.7 (frequency-capped prompt repetition) and M1.8 (raw nem_v2 diversity expansion) each moved math500 only ~+0.14 (to ~7.04) and did not beat M1.6's avg. The diminishing returns across repetition β diversity β volume (each ~+0.14) are the signature of corpus saturation on this NuminaMath-based math source.
How the loss function evolved
| stage | loss | idea |
|---|---|---|
| M1 | soft-label KD | match the target's full per-position distribution (forward-KL) |
| M1.5 | EAL | directly maximize Expected Acceptance Length (reward-shaped); uniform KL weighting across block positions |
| M1.6 β M1.9 | survival-EAL | EAL + KL position weights scaled by each position's survival probability (chance the block is still accepted at that depth) β focuses capacity on positions that actually get verified |
EAL is implemented as a negative, reward-maximizing objective (a sign flip on the acceptance reward); survival weighting was the M1.6 refinement that produced the largest single avg jump.
Where the remaining gap is (vs z-lab M0 = 6.05 avg)
| benchmark | best of ours | z-lab M0 | gap |
|---|---|---|---|
| gsm8k | 6.27 | 6.27 | 0.00 β |
| math500 | 7.18 | 8.24 | β1.06 |
| humaneval | 5.87 | 6.72 | β0.85 |
| mbpp | 5.35 | 5.61 | β0.26 |
| mt-bench | 3.00 | 3.43 | β0.43 |
Key conclusion: scale and epochs were already matched at M1 (888K / 6 epochs), so the residual gap is not a volume problem. It traces to data composition β M1 (and thus all warm-started descendants) was built on the v10-balanced mix, not z-lab's exact Nemotron-PTD-v2 + evol-codealpaca sources. The cleanest untested experiment is a fresh full-scale run on z-lab's exact source composition, not more fine-tuning on the inherited v10 base.
Last updated: 2026-06-14.