LauraGG
/

blt-reasoner-pilot1

Safetensors

Model card Files Files and versions

xet

Community

LauraGG commited on 19 days ago

Commit

572be28

verified ·

1 Parent(s): e4f8490

HANDOFF v3: 77.5% via bottleneck-as-regularizer at inference

Browse files

Files changed (1) hide show

HANDOFF_BLT_BREAKTHROUGH_2026-05-19.md +161 -0

HANDOFF_BLT_BREAKTHROUGH_2026-05-19.md ADDED Viewed

	@@ -0,0 +1,161 @@

+# BLT-Reasoner — Bottleneck-as-Regularizer Breakthrough
+**Status:** Campaign hit a major positive result on 2026-05-19: **77.5% on GSM8K-test (n=200)** from the same GRPO checkpoint we previously measured at 52.5%, by **lifting the y→only-z attention bottleneck at inference time** while keeping it during training.
+**Artifacts:** https://huggingface.co/LauraGG/blt-reasoner-pilot1
+---
+## The headline number
+| Setup | normal-z AR acc | Δ_random | Δ_zero |
+|---|---|---|---|
+| GRPO ckpt, eval **with** bottleneck (canonical, pre-registered) | **52.5%** | +15.5 pp ✓ | +52.5 pp ✓ |
+| **GRPO ckpt, eval WITHOUT bottleneck** | **77.5%** | **+22.5 pp** | **+70.5 pp** |
+- **+25 pp absolute** by flipping one inference-time flag (`block_y_to_x=False`).
+- Closes most of the 33-pp gap to Qwen2.5-Math-7B-Instruct + verbal CoT (~85%) — we're now ~8 pp behind that ceiling.
+- **z's content is *more* load-bearing, not less.** Δ_random grew from +15.5 to +22.5; Δ_zero grew from +52.5 to +70.5. Random z hurts more, and *no* z is more catastrophic, when the model is allowed to also see x.
+The model has internalized z as a *reasoning aid* during bottlenecked training. At inference, with x also available, it leverages z heavily — not as a substitute for x, but as a structured supplement.
+---
+## The campaign in one table
+All on Qwen2.5-Math-7B-Instruct + LoRA r=16, single GH200, GSM8K-test n=200 AR unless noted.
+| Recipe | normal AR | Δ_random | Δ_zero | H1 (pre-reg) | Comment |
+|---|---|---|---|---|---|
+| Abstract-CoT (prior work, 7B, 24 h) | 57% maj@8 / **MATH-500** | +3 pp | +5 pp | ✗ | Decorative latents — overturned negative result |
+| BLT 1.5B SFT (pilot 1) | 13% | +13 pp | +13 pp | ✗ | Load-bearing latents at small scale; low absolute |
+| BLT 7B SFT (pilot 2, no options) | 13% | +13 pp | +13 pp | ✗ | Same Δs as 1.5B; scale alone didn't help absolute |
+| BLT 7B + leak-closure (block_z→x) | 78% TF / – AR | – | – | ✗ | Closing leak alone insufficient; model regresses to y-prefix |
+| **BLT 7B + Options 1+3 SFT** | **51.0%** | +13.5 | +50.5 | ✗ | Full-y InfoNCE + MLP π — 4× lift in absolute |
+| **BLT 7B + Options 1+3 + GRPO** | **52.5%** | **+15.5** ✓ | **+52.5** ✓ | **✓** | Pre-registered thresholds CROSSED |
+| BLT 7B + per-slot multi-objective | 44.0% | +13.5 | +43.5 | ✗ | NEGATIVE — slot redundancy worsened |
+| GRPO ckpt on MATH problems | (LM 0.93→0.69) | – | – | – | Training on harder data drives stable_rank DOWN |
+| **GRPO ckpt, no-block at inference** | **77.5%** | **+22.5** | **+70.5** | **✓** | **BREAKTHROUGH: bottleneck-as-regularizer** |
+---
+## What's been learned, mechanistically
+### Confirmed
+1. **Continuous-latent + bottleneck + InfoNCE produces load-bearing z** (Δ_random ≥ 13 pp consistently from 1.5B onward).
+2. **MLP π is necessary for compression capacity.** Linear π was a real bottleneck; expanding to d→4d→d gave +38 pp absolute (13% → 51%).
+3. **Full-y InfoNCE target is necessary.** Answer-only target only required ~10 bits of z; full-y target requires ~hundreds of bits and drives stable_rank growth during training.
+4. **GRPO with verifier reward consolidates** — small absolute lift, but crosses the pre-registered Δ_random threshold.
+5. **Bottleneck-as-regularizer.** Training under strict bottleneck shapes z into a useful representation; lifting the bottleneck at inference time lets the model use BOTH x and z, producing dramatically better generation.
+### Falsified (negative results, each a real finding)
+1. **K=32 won't help.** Stable_rank diagnostic on K=16 ckpt was 6.73 — slots already redundant. Perturbation curve was flat.
+2. **Per-slot supervision (split y into K chunks, contrastive per slot) HURTS.** Reduced stable_rank further (6.73 → 5.68), dropped absolute accuracy to 44%. The model finds shortcuts that satisfy per-slot contrastive without actually specializing slots.
+3. **Harder data does NOT unlock richer z.** GRPO ckpt evaluated on MATH problems showed stable_rank=4.12 (DOWN from 6.73 on GSM8K); 500 steps of MATH training drove it further down to 2.82. Architecture has a low-rank attractor independent of training data.
+4. **Closing the z→x architectural leak alone is insufficient.** Model regresses to y-prefix autoregression when both bypass paths are blocked and supervision is weak.
+### The mechanistic synthesis
+The bottleneck-architecture has **a low-rank attractor**: the optimal z under (LM loss, InfoNCE, strict bottleneck) lives in a ~6-7 dimensional manifold. Adding slots, harder supervision, or harder data doesn't escape this. The "thinking" the model does is genuinely low-dimensional *under that training objective*.
+But **z is not decorative** — within its low-rank manifold it encodes problem-specific information that's load-bearing for y prediction. The bottleneck during training shapes z to be USEFUL, even if not high-dimensional.
+The breakthrough finding adds a second layer: **z's usefulness transfers to no-bottleneck inference**. With both x access and z access, the model leverages z to make better predictions. **The bottleneck was a training regularizer, not an inference-time architectural commitment.**
+---
+## Pre-registered criterion + interpretation note
+The pre-registered H1 was `Δ_random ≥ 15 pp AND Δ_zero ≥ 25 pp` *with the bottleneck active*.
+- GRPO ckpt with bottleneck: passes (15.5 / 52.5)
+- GRPO ckpt without bottleneck: passes more strongly (22.5 / 70.5)
+In both eval modes the architecture is content-load-bearing. The no-bottleneck mode is **out-of-pre-registration** but more practically interesting because it produces a competitive model.
+---
+## Open mechanistic questions worth careful thought (next steps)
+The 77.5% no-block result raises questions we should design experiments around carefully:
+### Q1: Is the no-block lift specific to GSM8K, or general?
+Hypothesis to test: run no-block ablation on MATH (~15 min, GRPO ckpt, n=100 MATH-test). If the lift transfers (e.g., MATH no-block > MATH with-block), the bottleneck-as-regularizer interpretation generalizes. If it doesn't, the GSM8K result may rely on GSM8K-specific structure of z.
+### Q2: At what training stage does the "transferable z" property emerge?
+Hypothesis: it requires the rich-supervision phase (Options 1+3). Test by running no-block eval on:
+- BLT 1.5B SFT final (does the property exist at 1.5B?)
+- BLT 7B pilot final (before Options 1+3 — the recipe with thin InfoNCE)
+- Options 1+3 SFT (51% baseline)
+- GRPO ckpt (52.5% baseline, where we found it)
+If the no-block lift is monotone with training quality, the recipe matters. If 1.5B already has it, the architecture matters more than the recipe.
+### Q3: Can we lift the gap to verbal CoT further with no-block-aware RL?
+Idea: GRPO where rollouts use no-block generation (closer to test-time behavior) and reward is on no-block answer correctness. Current GRPO trains and rewards under bottleneck; that's now suboptimal given the eval distribution shift.
+### Q4: Why does training-with-bottleneck → useful z transfer to no-bottleneck inference?
+This is the deepest open question. Hypotheses:
+- (a) The model has two attention pathways (x→y and z→y), each developed during different training conditions. At no-block inference, both fire; their contributions combine.
+- (b) z's representations are *redundant* with x's most-informative directions (because z is computed from x's hidden state). Lifting the bottleneck doesn't reveal new information — it just provides multiple access routes. The lift comes from reduced decoding variance, not new information.
+- (c) The bottlenecked training pushed the y-distribution to be sharp around the z-conditioned prediction. With x access, the model "votes" between z's prediction and a fresh x-based prediction, which is more robust.
+Discriminating (a)/(b)/(c) is mechanistically important — they suggest different scaling paths.
+### Q5: Is there a soft-bottleneck schedule that beats the hard-then-no schedule?
+Hypothesis: replacing the hard mask with a learnable scalar penalty (or scheduled annealing) might give a smoother training trajectory and possibly a better endpoint.
+---
+## Suggested next experiments (ranked by expected information / hour)
+1. **Q1: MATH no-block eval (~15 min).** Decisive test of whether the breakthrough generalizes beyond GSM8K. If positive → the recipe is genuinely useful. If negative → there's GSM8K-specific structure we're exploiting.
+2. **Q2: No-block evals across training stages (~1 hour).** Characterizes when "transferable z" emerges. Cheap and gives a clean curve for the writeup.
+3. **Q3: No-block-aware GRPO (~5–8 h).** Higher upside but speculative — could lift 77.5 → 80–85%. Implementation: a few-line change in `grpo_train.py`'s rollout sampler (pass `block_y_to_x=False`). Reference policy stays bottlenecked (KL anchor unchanged) so the policy gets RL signal aligned to no-block evaluation.
+4. **Q4 mechanistic probes** (variable cost). Direct readout of z's information content via linear probes; comparison of x→y attention weights with/without z available; activation patching to test (a)/(b)/(c).
+5. **Q5: Soft-bottleneck schedule (~5 h).** Implementation cost moderate; outcome uncertain.
+---
+## Reproducibility
+Public HF model repo: https://huggingface.co/LauraGG/blt-reasoner-pilot1
+```
+grpo_opt13/
+  final/
+    model/                    # LoRA adapter (the 77.5% / 52.5% checkpoint)
+    projector.pt              # MLP π (~90M params)
+    head.pt                   # InfoNCE head
+    ablation_n200_K16.json    # AR with bottleneck (52.5%)
+    ablation_no_block_y_to_x.json  # AR WITHOUT bottleneck (77.5%) ← the breakthrough
+    ablation_teacher_forced.json
+    capacity_diagnostic.json
+    rank_on_math.json         # stable_rank=4.12 on MATH problems (OOD)
+exp7b_opt13/                  # SFT phase that built the projector
+pilot7b/                      # original 7B pilot (no Options)
+per_slot_exp/                 # negative result
+controls/                     # ablation controls from 1.5B campaign
+HANDOFF_BLT_BREAKTHROUGH_2026-05-19.md   # this document
+HANDOFF_BLT_REASONER_2026-05-17.md       # earlier writeup
+```
+Resume on a fresh instance:
+```bash
+pip install transformers peft bitsandbytes datasets safetensors huggingface_hub
+# To reproduce the 77.5% number:
+python3 -m experiments.blt_reasoner.eval \
+    --ckpt LauraGG/blt-reasoner-pilot1:grpo_opt13/final \
+    --config experiments/blt_reasoner/configs/grpo_from_opt13.json \
+    --n 200 --K 16 --max_new_tokens 192 --temperature 0.0 \
+    --no_block_y_to_x
+```