HANDOFF: add GRPO Phase C null result + revised next steps
Browse files
HANDOFF_BLT_REASONER_2026-05-17.md
CHANGED
|
@@ -200,12 +200,34 @@ The `--resume_from` flag accepts either a local path or `<repo>:<subfolder>` and
|
|
| 200 |
|
| 201 |
---
|
| 202 |
|
| 203 |
-
##
|
| 204 |
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
|
| 210 |
---
|
| 211 |
|
|
|
|
| 200 |
|
| 201 |
---
|
| 202 |
|
| 203 |
+
## GRPO Phase C result (added 2026-05-17 18:45 UTC)
|
| 204 |
|
| 205 |
+
**Null on accuracy, robust on latent.** 600 steps × 16 rollouts on GSM8K-train with math-verifier reward, β=0.02, length_penalty_coef=0.05, max_new_tokens=192. 2 h 10 min wall.
|
| 206 |
+
|
| 207 |
+
| Metric | SFT pilot (n=200) | GRPO Phase C (n=100) |
|
| 208 |
+
|---|---|---|
|
| 209 |
+
| normal-z acc | **13.0%** | **13.0%** |
|
| 210 |
+
| random-z acc | 0.0% | 0.0% |
|
| 211 |
+
| zero-z acc | 0.0% | 1.0% |
|
| 212 |
+
| Δ_random | 13 pp | 13 pp |
|
| 213 |
+
| Δ_zero | 13 pp | 12 pp |
|
| 214 |
+
|
| 215 |
+
**Training-time reward oscillated** in [−0.34, −0.25] for all 600 steps with no upward trend. KL spiked to ~9000 at some steps before clamping, indicating the per-token K3 estimator was unstable at β=0.02 + clamp=20.
|
| 216 |
+
|
| 217 |
+
**Three diagnoses, ranked:**
|
| 218 |
+
|
| 219 |
+
1. **Sparse reward variance** — only ~22% rollouts correct → high-variance advantages over groups of 4 rollouts.
|
| 220 |
+
2. **KL unstable** — per-token KL spiked to 200–9000 multiple times; β too low or clamp too high. Policy drift was contaminating PG signal.
|
| 221 |
+
3. **Length penalty backfired** — at 0.05 × (n_tokens/192), correct GSM8K answers (~150 tokens) lost ~0.04 reward; the policy may have learned shorter (and therefore wrong) answers.
|
| 222 |
+
|
| 223 |
+
**Architecturally robust.** Unlike Abstract-CoT's GRPO (which crushed an already-collapsed z further to reward=0), BLT-Reasoner's GRPO maintained the 13 pp Δ throughout. The bottleneck + InfoNCE-trained z survives the RL phase intact. This is a positive secondary finding even though the primary "lift accuracy" goal was not met.
|
| 224 |
+
|
| 225 |
+
## Next steps (revised in light of GRPO null result)
|
| 226 |
+
2. **Push SFT further before more RL.** The val-LM curve was still descending and the z-ablation Δ was still rising at step 12000 — we hit the budget, not the asymptote. Longer K=16 training (e.g., 4000 more steps) might cross 15 pp at the SFT stage alone.
|
| 227 |
+
3. **Re-tune GRPO** if pursued: drop `length_penalty_coef` to 0 (or use a more careful length normalizer), increase β to 0.05–0.1, lower kl_clamp to 5, possibly increase group_size to 8 for less noisy advantages.
|
| 228 |
+
4. **7B scale-up.** Same recipe, Qwen2.5-Math-7B base, expected wall ~24 h on a GH200 cluster of 4–8.
|
| 229 |
+
5. **Broader benchmarks.** MATH, AIME, GPQA. Whether the latent-only bottleneck is too restrictive on harder problems is the open empirical question.
|
| 230 |
+
6. **Mechanistic probes** on z: linear-readout of intermediate quantities (operands, intermediate results) from the M latent slots. Direct test of whether the model is "thinking arithmetically" inside z.
|
| 231 |
|
| 232 |
---
|
| 233 |
|