LauraGG commited on
Commit
6e77ac7
·
verified ·
1 Parent(s): a88ff9b

HANDOFF: add GRPO Phase C null result + revised next steps

Browse files
Files changed (1) hide show
  1. HANDOFF_BLT_REASONER_2026-05-17.md +27 -5
HANDOFF_BLT_REASONER_2026-05-17.md CHANGED
@@ -200,12 +200,34 @@ The `--resume_from` flag accepts either a local path or `<repo>:<subfolder>` and
200
 
201
  ---
202
 
203
- ## Next steps
204
 
205
- 1. **GRPO Phase C** (running). Math-verifier reward on GSM8K, 600 steps from the SFT checkpoint. Reference policy = frozen copy of the SFT ckpt; KL anchor with β = 0.02. **Step 5 already shows n_signal = 4/4 every step**, unlike the Abstract-CoT GRPO attempt that stuck at n_signal = 0 for 240 steps. Whether RL crosses the pre-registered Δ threshold is the question.
206
- 2. **7B scale-up** if Phase C succeeds. Same recipe, Qwen2.5-Math-7B base, expected wall ~24 h on a GH200 cluster of 4–8.
207
- 3. **Broader benchmarks.** MATH, AIME, GPQA. Whether the latent-only bottleneck is too restrictive on harder problems is the open empirical question.
208
- 4. **Mechanistic probes** on z: linear-readout of intermediate quantities (operands, intermediate results) from the M latent slots. Direct test of whether the model is "thinking arithmetically" inside z.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
 
210
  ---
211
 
 
200
 
201
  ---
202
 
203
+ ## GRPO Phase C result (added 2026-05-17 18:45 UTC)
204
 
205
+ **Null on accuracy, robust on latent.** 600 steps × 16 rollouts on GSM8K-train with math-verifier reward, β=0.02, length_penalty_coef=0.05, max_new_tokens=192. 2 h 10 min wall.
206
+
207
+ | Metric | SFT pilot (n=200) | GRPO Phase C (n=100) |
208
+ |---|---|---|
209
+ | normal-z acc | **13.0%** | **13.0%** |
210
+ | random-z acc | 0.0% | 0.0% |
211
+ | zero-z acc | 0.0% | 1.0% |
212
+ | Δ_random | 13 pp | 13 pp |
213
+ | Δ_zero | 13 pp | 12 pp |
214
+
215
+ **Training-time reward oscillated** in [−0.34, −0.25] for all 600 steps with no upward trend. KL spiked to ~9000 at some steps before clamping, indicating the per-token K3 estimator was unstable at β=0.02 + clamp=20.
216
+
217
+ **Three diagnoses, ranked:**
218
+
219
+ 1. **Sparse reward variance** — only ~22% rollouts correct → high-variance advantages over groups of 4 rollouts.
220
+ 2. **KL unstable** — per-token KL spiked to 200–9000 multiple times; β too low or clamp too high. Policy drift was contaminating PG signal.
221
+ 3. **Length penalty backfired** — at 0.05 × (n_tokens/192), correct GSM8K answers (~150 tokens) lost ~0.04 reward; the policy may have learned shorter (and therefore wrong) answers.
222
+
223
+ **Architecturally robust.** Unlike Abstract-CoT's GRPO (which crushed an already-collapsed z further to reward=0), BLT-Reasoner's GRPO maintained the 13 pp Δ throughout. The bottleneck + InfoNCE-trained z survives the RL phase intact. This is a positive secondary finding even though the primary "lift accuracy" goal was not met.
224
+
225
+ ## Next steps (revised in light of GRPO null result)
226
+ 2. **Push SFT further before more RL.** The val-LM curve was still descending and the z-ablation Δ was still rising at step 12000 — we hit the budget, not the asymptote. Longer K=16 training (e.g., 4000 more steps) might cross 15 pp at the SFT stage alone.
227
+ 3. **Re-tune GRPO** if pursued: drop `length_penalty_coef` to 0 (or use a more careful length normalizer), increase β to 0.05–0.1, lower kl_clamp to 5, possibly increase group_size to 8 for less noisy advantages.
228
+ 4. **7B scale-up.** Same recipe, Qwen2.5-Math-7B base, expected wall ~24 h on a GH200 cluster of 4–8.
229
+ 5. **Broader benchmarks.** MATH, AIME, GPQA. Whether the latent-only bottleneck is too restrictive on harder problems is the open empirical question.
230
+ 6. **Mechanistic probes** on z: linear-readout of intermediate quantities (operands, intermediate results) from the M latent slots. Direct test of whether the model is "thinking arithmetically" inside z.
231
 
232
  ---
233