LauraGG commited on
Commit
572be28
Β·
verified Β·
1 Parent(s): e4f8490

HANDOFF v3: 77.5% via bottleneck-as-regularizer at inference

Browse files
HANDOFF_BLT_BREAKTHROUGH_2026-05-19.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BLT-Reasoner β€” Bottleneck-as-Regularizer Breakthrough
2
+
3
+ **Status:** Campaign hit a major positive result on 2026-05-19: **77.5% on GSM8K-test (n=200)** from the same GRPO checkpoint we previously measured at 52.5%, by **lifting the y→only-z attention bottleneck at inference time** while keeping it during training.
4
+
5
+ **Artifacts:** https://huggingface.co/LauraGG/blt-reasoner-pilot1
6
+
7
+ ---
8
+
9
+ ## The headline number
10
+
11
+ | Setup | normal-z AR acc | Ξ”_random | Ξ”_zero |
12
+ |---|---|---|---|
13
+ | GRPO ckpt, eval **with** bottleneck (canonical, pre-registered) | **52.5%** | +15.5 pp βœ“ | +52.5 pp βœ“ |
14
+ | **GRPO ckpt, eval WITHOUT bottleneck** | **77.5%** | **+22.5 pp** | **+70.5 pp** |
15
+
16
+ - **+25 pp absolute** by flipping one inference-time flag (`block_y_to_x=False`).
17
+ - Closes most of the 33-pp gap to Qwen2.5-Math-7B-Instruct + verbal CoT (~85%) β€” we're now ~8 pp behind that ceiling.
18
+ - **z's content is *more* load-bearing, not less.** Ξ”_random grew from +15.5 to +22.5; Ξ”_zero grew from +52.5 to +70.5. Random z hurts more, and *no* z is more catastrophic, when the model is allowed to also see x.
19
+
20
+ The model has internalized z as a *reasoning aid* during bottlenecked training. At inference, with x also available, it leverages z heavily β€” not as a substitute for x, but as a structured supplement.
21
+
22
+ ---
23
+
24
+ ## The campaign in one table
25
+
26
+ All on Qwen2.5-Math-7B-Instruct + LoRA r=16, single GH200, GSM8K-test n=200 AR unless noted.
27
+
28
+ | Recipe | normal AR | Ξ”_random | Ξ”_zero | H1 (pre-reg) | Comment |
29
+ |---|---|---|---|---|---|
30
+ | Abstract-CoT (prior work, 7B, 24 h) | 57% maj@8 / **MATH-500** | +3 pp | +5 pp | βœ— | Decorative latents β€” overturned negative result |
31
+ | BLT 1.5B SFT (pilot 1) | 13% | +13 pp | +13 pp | βœ— | Load-bearing latents at small scale; low absolute |
32
+ | BLT 7B SFT (pilot 2, no options) | 13% | +13 pp | +13 pp | βœ— | Same Ξ”s as 1.5B; scale alone didn't help absolute |
33
+ | BLT 7B + leak-closure (block_zβ†’x) | 78% TF / – AR | – | – | βœ— | Closing leak alone insufficient; model regresses to y-prefix |
34
+ | **BLT 7B + Options 1+3 SFT** | **51.0%** | +13.5 | +50.5 | βœ— | Full-y InfoNCE + MLP Ο€ β€” 4Γ— lift in absolute |
35
+ | **BLT 7B + Options 1+3 + GRPO** | **52.5%** | **+15.5** βœ“ | **+52.5** βœ“ | **βœ“** | Pre-registered thresholds CROSSED |
36
+ | BLT 7B + per-slot multi-objective | 44.0% | +13.5 | +43.5 | βœ— | NEGATIVE β€” slot redundancy worsened |
37
+ | GRPO ckpt on MATH problems | (LM 0.93β†’0.69) | – | – | – | Training on harder data drives stable_rank DOWN |
38
+ | **GRPO ckpt, no-block at inference** | **77.5%** | **+22.5** | **+70.5** | **βœ“** | **BREAKTHROUGH: bottleneck-as-regularizer** |
39
+
40
+ ---
41
+
42
+ ## What's been learned, mechanistically
43
+
44
+ ### Confirmed
45
+ 1. **Continuous-latent + bottleneck + InfoNCE produces load-bearing z** (Ξ”_random β‰₯ 13 pp consistently from 1.5B onward).
46
+ 2. **MLP π is necessary for compression capacity.** Linear π was a real bottleneck; expanding to d→4d→d gave +38 pp absolute (13% → 51%).
47
+ 3. **Full-y InfoNCE target is necessary.** Answer-only target only required ~10 bits of z; full-y target requires ~hundreds of bits and drives stable_rank growth during training.
48
+ 4. **GRPO with verifier reward consolidates** β€” small absolute lift, but crosses the pre-registered Ξ”_random threshold.
49
+ 5. **Bottleneck-as-regularizer.** Training under strict bottleneck shapes z into a useful representation; lifting the bottleneck at inference time lets the model use BOTH x and z, producing dramatically better generation.
50
+
51
+ ### Falsified (negative results, each a real finding)
52
+ 1. **K=32 won't help.** Stable_rank diagnostic on K=16 ckpt was 6.73 β€” slots already redundant. Perturbation curve was flat.
53
+ 2. **Per-slot supervision (split y into K chunks, contrastive per slot) HURTS.** Reduced stable_rank further (6.73 β†’ 5.68), dropped absolute accuracy to 44%. The model finds shortcuts that satisfy per-slot contrastive without actually specializing slots.
54
+ 3. **Harder data does NOT unlock richer z.** GRPO ckpt evaluated on MATH problems showed stable_rank=4.12 (DOWN from 6.73 on GSM8K); 500 steps of MATH training drove it further down to 2.82. Architecture has a low-rank attractor independent of training data.
55
+ 4. **Closing the z→x architectural leak alone is insufficient.** Model regresses to y-prefix autoregression when both bypass paths are blocked and supervision is weak.
56
+
57
+ ### The mechanistic synthesis
58
+ The bottleneck-architecture has **a low-rank attractor**: the optimal z under (LM loss, InfoNCE, strict bottleneck) lives in a ~6-7 dimensional manifold. Adding slots, harder supervision, or harder data doesn't escape this. The "thinking" the model does is genuinely low-dimensional *under that training objective*.
59
+
60
+ But **z is not decorative** β€” within its low-rank manifold it encodes problem-specific information that's load-bearing for y prediction. The bottleneck during training shapes z to be USEFUL, even if not high-dimensional.
61
+
62
+ The breakthrough finding adds a second layer: **z's usefulness transfers to no-bottleneck inference**. With both x access and z access, the model leverages z to make better predictions. **The bottleneck was a training regularizer, not an inference-time architectural commitment.**
63
+
64
+ ---
65
+
66
+ ## Pre-registered criterion + interpretation note
67
+
68
+ The pre-registered H1 was `Ξ”_random β‰₯ 15 pp AND Ξ”_zero β‰₯ 25 pp` *with the bottleneck active*.
69
+
70
+ - GRPO ckpt with bottleneck: passes (15.5 / 52.5)
71
+ - GRPO ckpt without bottleneck: passes more strongly (22.5 / 70.5)
72
+
73
+ In both eval modes the architecture is content-load-bearing. The no-bottleneck mode is **out-of-pre-registration** but more practically interesting because it produces a competitive model.
74
+
75
+ ---
76
+
77
+ ## Open mechanistic questions worth careful thought (next steps)
78
+
79
+ The 77.5% no-block result raises questions we should design experiments around carefully:
80
+
81
+ ### Q1: Is the no-block lift specific to GSM8K, or general?
82
+
83
+ Hypothesis to test: run no-block ablation on MATH (~15 min, GRPO ckpt, n=100 MATH-test). If the lift transfers (e.g., MATH no-block > MATH with-block), the bottleneck-as-regularizer interpretation generalizes. If it doesn't, the GSM8K result may rely on GSM8K-specific structure of z.
84
+
85
+ ### Q2: At what training stage does the "transferable z" property emerge?
86
+
87
+ Hypothesis: it requires the rich-supervision phase (Options 1+3). Test by running no-block eval on:
88
+ - BLT 1.5B SFT final (does the property exist at 1.5B?)
89
+ - BLT 7B pilot final (before Options 1+3 β€” the recipe with thin InfoNCE)
90
+ - Options 1+3 SFT (51% baseline)
91
+ - GRPO ckpt (52.5% baseline, where we found it)
92
+
93
+ If the no-block lift is monotone with training quality, the recipe matters. If 1.5B already has it, the architecture matters more than the recipe.
94
+
95
+ ### Q3: Can we lift the gap to verbal CoT further with no-block-aware RL?
96
+
97
+ Idea: GRPO where rollouts use no-block generation (closer to test-time behavior) and reward is on no-block answer correctness. Current GRPO trains and rewards under bottleneck; that's now suboptimal given the eval distribution shift.
98
+
99
+ ### Q4: Why does training-with-bottleneck β†’ useful z transfer to no-bottleneck inference?
100
+
101
+ This is the deepest open question. Hypotheses:
102
+ - (a) The model has two attention pathways (x→y and z→y), each developed during different training conditions. At no-block inference, both fire; their contributions combine.
103
+ - (b) z's representations are *redundant* with x's most-informative directions (because z is computed from x's hidden state). Lifting the bottleneck doesn't reveal new information β€” it just provides multiple access routes. The lift comes from reduced decoding variance, not new information.
104
+ - (c) The bottlenecked training pushed the y-distribution to be sharp around the z-conditioned prediction. With x access, the model "votes" between z's prediction and a fresh x-based prediction, which is more robust.
105
+
106
+ Discriminating (a)/(b)/(c) is mechanistically important β€” they suggest different scaling paths.
107
+
108
+ ### Q5: Is there a soft-bottleneck schedule that beats the hard-then-no schedule?
109
+
110
+ Hypothesis: replacing the hard mask with a learnable scalar penalty (or scheduled annealing) might give a smoother training trajectory and possibly a better endpoint.
111
+
112
+ ---
113
+
114
+ ## Suggested next experiments (ranked by expected information / hour)
115
+
116
+ 1. **Q1: MATH no-block eval (~15 min).** Decisive test of whether the breakthrough generalizes beyond GSM8K. If positive β†’ the recipe is genuinely useful. If negative β†’ there's GSM8K-specific structure we're exploiting.
117
+
118
+ 2. **Q2: No-block evals across training stages (~1 hour).** Characterizes when "transferable z" emerges. Cheap and gives a clean curve for the writeup.
119
+
120
+ 3. **Q3: No-block-aware GRPO (~5–8 h).** Higher upside but speculative β€” could lift 77.5 β†’ 80–85%. Implementation: a few-line change in `grpo_train.py`'s rollout sampler (pass `block_y_to_x=False`). Reference policy stays bottlenecked (KL anchor unchanged) so the policy gets RL signal aligned to no-block evaluation.
121
+
122
+ 4. **Q4 mechanistic probes** (variable cost). Direct readout of z's information content via linear probes; comparison of x→y attention weights with/without z available; activation patching to test (a)/(b)/(c).
123
+
124
+ 5. **Q5: Soft-bottleneck schedule (~5 h).** Implementation cost moderate; outcome uncertain.
125
+
126
+ ---
127
+
128
+ ## Reproducibility
129
+
130
+ Public HF model repo: https://huggingface.co/LauraGG/blt-reasoner-pilot1
131
+
132
+ ```
133
+ grpo_opt13/
134
+ final/
135
+ model/ # LoRA adapter (the 77.5% / 52.5% checkpoint)
136
+ projector.pt # MLP Ο€ (~90M params)
137
+ head.pt # InfoNCE head
138
+ ablation_n200_K16.json # AR with bottleneck (52.5%)
139
+ ablation_no_block_y_to_x.json # AR WITHOUT bottleneck (77.5%) ← the breakthrough
140
+ ablation_teacher_forced.json
141
+ capacity_diagnostic.json
142
+ rank_on_math.json # stable_rank=4.12 on MATH problems (OOD)
143
+ exp7b_opt13/ # SFT phase that built the projector
144
+ pilot7b/ # original 7B pilot (no Options)
145
+ per_slot_exp/ # negative result
146
+ controls/ # ablation controls from 1.5B campaign
147
+ HANDOFF_BLT_BREAKTHROUGH_2026-05-19.md # this document
148
+ HANDOFF_BLT_REASONER_2026-05-17.md # earlier writeup
149
+ ```
150
+
151
+ Resume on a fresh instance:
152
+
153
+ ```bash
154
+ pip install transformers peft bitsandbytes datasets safetensors huggingface_hub
155
+ # To reproduce the 77.5% number:
156
+ python3 -m experiments.blt_reasoner.eval \
157
+ --ckpt LauraGG/blt-reasoner-pilot1:grpo_opt13/final \
158
+ --config experiments/blt_reasoner/configs/grpo_from_opt13.json \
159
+ --n 200 --K 16 --max_new_tokens 192 --temperature 0.0 \
160
+ --no_block_y_to_x
161
+ ```