LauraGG
/

blt-reasoner-pilot1

Safetensors

Model card Files Files and versions

xet

Community

LauraGG commited on 20 days ago

Commit

6e77ac7

verified ·

1 Parent(s): a88ff9b

HANDOFF: add GRPO Phase C null result + revised next steps

Browse files

Files changed (1) hide show

HANDOFF_BLT_REASONER_2026-05-17.md +27 -5

HANDOFF_BLT_REASONER_2026-05-17.md CHANGED Viewed

@@ -200,12 +200,34 @@ The `--resume_from` flag accepts either a local path or `<repo>:<subfolder>` and
 ---
-## Next steps
-1. **GRPO Phase C** (running). Math-verifier reward on GSM8K, 600 steps from the SFT checkpoint. Reference policy = frozen copy of the SFT ckpt; KL anchor with β = 0.02. **Step 5 already shows n_signal = 4/4 every step**, unlike the Abstract-CoT GRPO attempt that stuck at n_signal = 0 for 240 steps. Whether RL crosses the pre-registered Δ threshold is the question.
-2. **7B scale-up** if Phase C succeeds. Same recipe, Qwen2.5-Math-7B base, expected wall ~24 h on a GH200 cluster of 4–8.
-3. **Broader benchmarks.** MATH, AIME, GPQA. Whether the latent-only bottleneck is too restrictive on harder problems is the open empirical question.
-4. **Mechanistic probes** on z: linear-readout of intermediate quantities (operands, intermediate results) from the M latent slots. Direct test of whether the model is "thinking arithmetically" inside z.
 ---

 ---
+## GRPO Phase C result (added 2026-05-17 18:45 UTC)
+**Null on accuracy, robust on latent.** 600 steps × 16 rollouts on GSM8K-train with math-verifier reward, β=0.02, length_penalty_coef=0.05, max_new_tokens=192. 2 h 10 min wall.
+| Metric | SFT pilot (n=200) | GRPO Phase C (n=100) |
+|---|---|---|
+| normal-z acc | **13.0%** | **13.0%** |
+| random-z acc | 0.0% | 0.0% |
+| zero-z acc | 0.0% | 1.0% |
+| Δ_random | 13 pp | 13 pp |
+| Δ_zero | 13 pp | 12 pp |
+**Training-time reward oscillated** in [−0.34, −0.25] for all 600 steps with no upward trend. KL spiked to ~9000 at some steps before clamping, indicating the per-token K3 estimator was unstable at β=0.02 + clamp=20.
+**Three diagnoses, ranked:**
+1. **Sparse reward variance** — only ~22% rollouts correct → high-variance advantages over groups of 4 rollouts.
+2. **KL unstable** — per-token KL spiked to 200–9000 multiple times; β too low or clamp too high. Policy drift was contaminating PG signal.
+3. **Length penalty backfired** — at 0.05 × (n_tokens/192), correct GSM8K answers (~150 tokens) lost ~0.04 reward; the policy may have learned shorter (and therefore wrong) answers.
+**Architecturally robust.** Unlike Abstract-CoT's GRPO (which crushed an already-collapsed z further to reward=0), BLT-Reasoner's GRPO maintained the 13 pp Δ throughout. The bottleneck + InfoNCE-trained z survives the RL phase intact. This is a positive secondary finding even though the primary "lift accuracy" goal was not met.
+## Next steps (revised in light of GRPO null result)
+2. **Push SFT further before more RL.** The val-LM curve was still descending and the z-ablation Δ was still rising at step 12000 — we hit the budget, not the asymptote. Longer K=16 training (e.g., 4000 more steps) might cross 15 pp at the SFT stage alone.
+3. **Re-tune GRPO** if pursued: drop `length_penalty_coef` to 0 (or use a more careful length normalizer), increase β to 0.05–0.1, lower kl_clamp to 5, possibly increase group_size to 8 for less noisy advantages.
+4. **7B scale-up.** Same recipe, Qwen2.5-Math-7B base, expected wall ~24 h on a GH200 cluster of 4–8.
+5. **Broader benchmarks.** MATH, AIME, GPQA. Whether the latent-only bottleneck is too restrictive on harder problems is the open empirical question.
+6. **Mechanistic probes** on z: linear-readout of intermediate quantities (operands, intermediate results) from the M latent slots. Direct test of whether the model is "thinking arithmetically" inside z.
 ---