HANDOFF v3: 77.5% via bottleneck-as-regularizer at inference
Browse files
HANDOFF_BLT_BREAKTHROUGH_2026-05-19.md
ADDED
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# BLT-Reasoner β Bottleneck-as-Regularizer Breakthrough
|
| 2 |
+
|
| 3 |
+
**Status:** Campaign hit a major positive result on 2026-05-19: **77.5% on GSM8K-test (n=200)** from the same GRPO checkpoint we previously measured at 52.5%, by **lifting the yβonly-z attention bottleneck at inference time** while keeping it during training.
|
| 4 |
+
|
| 5 |
+
**Artifacts:** https://huggingface.co/LauraGG/blt-reasoner-pilot1
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## The headline number
|
| 10 |
+
|
| 11 |
+
| Setup | normal-z AR acc | Ξ_random | Ξ_zero |
|
| 12 |
+
|---|---|---|---|
|
| 13 |
+
| GRPO ckpt, eval **with** bottleneck (canonical, pre-registered) | **52.5%** | +15.5 pp β | +52.5 pp β |
|
| 14 |
+
| **GRPO ckpt, eval WITHOUT bottleneck** | **77.5%** | **+22.5 pp** | **+70.5 pp** |
|
| 15 |
+
|
| 16 |
+
- **+25 pp absolute** by flipping one inference-time flag (`block_y_to_x=False`).
|
| 17 |
+
- Closes most of the 33-pp gap to Qwen2.5-Math-7B-Instruct + verbal CoT (~85%) β we're now ~8 pp behind that ceiling.
|
| 18 |
+
- **z's content is *more* load-bearing, not less.** Ξ_random grew from +15.5 to +22.5; Ξ_zero grew from +52.5 to +70.5. Random z hurts more, and *no* z is more catastrophic, when the model is allowed to also see x.
|
| 19 |
+
|
| 20 |
+
The model has internalized z as a *reasoning aid* during bottlenecked training. At inference, with x also available, it leverages z heavily β not as a substitute for x, but as a structured supplement.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## The campaign in one table
|
| 25 |
+
|
| 26 |
+
All on Qwen2.5-Math-7B-Instruct + LoRA r=16, single GH200, GSM8K-test n=200 AR unless noted.
|
| 27 |
+
|
| 28 |
+
| Recipe | normal AR | Ξ_random | Ξ_zero | H1 (pre-reg) | Comment |
|
| 29 |
+
|---|---|---|---|---|---|
|
| 30 |
+
| Abstract-CoT (prior work, 7B, 24 h) | 57% maj@8 / **MATH-500** | +3 pp | +5 pp | β | Decorative latents β overturned negative result |
|
| 31 |
+
| BLT 1.5B SFT (pilot 1) | 13% | +13 pp | +13 pp | β | Load-bearing latents at small scale; low absolute |
|
| 32 |
+
| BLT 7B SFT (pilot 2, no options) | 13% | +13 pp | +13 pp | β | Same Ξs as 1.5B; scale alone didn't help absolute |
|
| 33 |
+
| BLT 7B + leak-closure (block_zβx) | 78% TF / β AR | β | β | β | Closing leak alone insufficient; model regresses to y-prefix |
|
| 34 |
+
| **BLT 7B + Options 1+3 SFT** | **51.0%** | +13.5 | +50.5 | β | Full-y InfoNCE + MLP Ο β 4Γ lift in absolute |
|
| 35 |
+
| **BLT 7B + Options 1+3 + GRPO** | **52.5%** | **+15.5** β | **+52.5** β | **β** | Pre-registered thresholds CROSSED |
|
| 36 |
+
| BLT 7B + per-slot multi-objective | 44.0% | +13.5 | +43.5 | β | NEGATIVE β slot redundancy worsened |
|
| 37 |
+
| GRPO ckpt on MATH problems | (LM 0.93β0.69) | β | β | β | Training on harder data drives stable_rank DOWN |
|
| 38 |
+
| **GRPO ckpt, no-block at inference** | **77.5%** | **+22.5** | **+70.5** | **β** | **BREAKTHROUGH: bottleneck-as-regularizer** |
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## What's been learned, mechanistically
|
| 43 |
+
|
| 44 |
+
### Confirmed
|
| 45 |
+
1. **Continuous-latent + bottleneck + InfoNCE produces load-bearing z** (Ξ_random β₯ 13 pp consistently from 1.5B onward).
|
| 46 |
+
2. **MLP Ο is necessary for compression capacity.** Linear Ο was a real bottleneck; expanding to dβ4dβd gave +38 pp absolute (13% β 51%).
|
| 47 |
+
3. **Full-y InfoNCE target is necessary.** Answer-only target only required ~10 bits of z; full-y target requires ~hundreds of bits and drives stable_rank growth during training.
|
| 48 |
+
4. **GRPO with verifier reward consolidates** β small absolute lift, but crosses the pre-registered Ξ_random threshold.
|
| 49 |
+
5. **Bottleneck-as-regularizer.** Training under strict bottleneck shapes z into a useful representation; lifting the bottleneck at inference time lets the model use BOTH x and z, producing dramatically better generation.
|
| 50 |
+
|
| 51 |
+
### Falsified (negative results, each a real finding)
|
| 52 |
+
1. **K=32 won't help.** Stable_rank diagnostic on K=16 ckpt was 6.73 β slots already redundant. Perturbation curve was flat.
|
| 53 |
+
2. **Per-slot supervision (split y into K chunks, contrastive per slot) HURTS.** Reduced stable_rank further (6.73 β 5.68), dropped absolute accuracy to 44%. The model finds shortcuts that satisfy per-slot contrastive without actually specializing slots.
|
| 54 |
+
3. **Harder data does NOT unlock richer z.** GRPO ckpt evaluated on MATH problems showed stable_rank=4.12 (DOWN from 6.73 on GSM8K); 500 steps of MATH training drove it further down to 2.82. Architecture has a low-rank attractor independent of training data.
|
| 55 |
+
4. **Closing the zβx architectural leak alone is insufficient.** Model regresses to y-prefix autoregression when both bypass paths are blocked and supervision is weak.
|
| 56 |
+
|
| 57 |
+
### The mechanistic synthesis
|
| 58 |
+
The bottleneck-architecture has **a low-rank attractor**: the optimal z under (LM loss, InfoNCE, strict bottleneck) lives in a ~6-7 dimensional manifold. Adding slots, harder supervision, or harder data doesn't escape this. The "thinking" the model does is genuinely low-dimensional *under that training objective*.
|
| 59 |
+
|
| 60 |
+
But **z is not decorative** β within its low-rank manifold it encodes problem-specific information that's load-bearing for y prediction. The bottleneck during training shapes z to be USEFUL, even if not high-dimensional.
|
| 61 |
+
|
| 62 |
+
The breakthrough finding adds a second layer: **z's usefulness transfers to no-bottleneck inference**. With both x access and z access, the model leverages z to make better predictions. **The bottleneck was a training regularizer, not an inference-time architectural commitment.**
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## Pre-registered criterion + interpretation note
|
| 67 |
+
|
| 68 |
+
The pre-registered H1 was `Ξ_random β₯ 15 pp AND Ξ_zero β₯ 25 pp` *with the bottleneck active*.
|
| 69 |
+
|
| 70 |
+
- GRPO ckpt with bottleneck: passes (15.5 / 52.5)
|
| 71 |
+
- GRPO ckpt without bottleneck: passes more strongly (22.5 / 70.5)
|
| 72 |
+
|
| 73 |
+
In both eval modes the architecture is content-load-bearing. The no-bottleneck mode is **out-of-pre-registration** but more practically interesting because it produces a competitive model.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## Open mechanistic questions worth careful thought (next steps)
|
| 78 |
+
|
| 79 |
+
The 77.5% no-block result raises questions we should design experiments around carefully:
|
| 80 |
+
|
| 81 |
+
### Q1: Is the no-block lift specific to GSM8K, or general?
|
| 82 |
+
|
| 83 |
+
Hypothesis to test: run no-block ablation on MATH (~15 min, GRPO ckpt, n=100 MATH-test). If the lift transfers (e.g., MATH no-block > MATH with-block), the bottleneck-as-regularizer interpretation generalizes. If it doesn't, the GSM8K result may rely on GSM8K-specific structure of z.
|
| 84 |
+
|
| 85 |
+
### Q2: At what training stage does the "transferable z" property emerge?
|
| 86 |
+
|
| 87 |
+
Hypothesis: it requires the rich-supervision phase (Options 1+3). Test by running no-block eval on:
|
| 88 |
+
- BLT 1.5B SFT final (does the property exist at 1.5B?)
|
| 89 |
+
- BLT 7B pilot final (before Options 1+3 β the recipe with thin InfoNCE)
|
| 90 |
+
- Options 1+3 SFT (51% baseline)
|
| 91 |
+
- GRPO ckpt (52.5% baseline, where we found it)
|
| 92 |
+
|
| 93 |
+
If the no-block lift is monotone with training quality, the recipe matters. If 1.5B already has it, the architecture matters more than the recipe.
|
| 94 |
+
|
| 95 |
+
### Q3: Can we lift the gap to verbal CoT further with no-block-aware RL?
|
| 96 |
+
|
| 97 |
+
Idea: GRPO where rollouts use no-block generation (closer to test-time behavior) and reward is on no-block answer correctness. Current GRPO trains and rewards under bottleneck; that's now suboptimal given the eval distribution shift.
|
| 98 |
+
|
| 99 |
+
### Q4: Why does training-with-bottleneck β useful z transfer to no-bottleneck inference?
|
| 100 |
+
|
| 101 |
+
This is the deepest open question. Hypotheses:
|
| 102 |
+
- (a) The model has two attention pathways (xβy and zβy), each developed during different training conditions. At no-block inference, both fire; their contributions combine.
|
| 103 |
+
- (b) z's representations are *redundant* with x's most-informative directions (because z is computed from x's hidden state). Lifting the bottleneck doesn't reveal new information β it just provides multiple access routes. The lift comes from reduced decoding variance, not new information.
|
| 104 |
+
- (c) The bottlenecked training pushed the y-distribution to be sharp around the z-conditioned prediction. With x access, the model "votes" between z's prediction and a fresh x-based prediction, which is more robust.
|
| 105 |
+
|
| 106 |
+
Discriminating (a)/(b)/(c) is mechanistically important β they suggest different scaling paths.
|
| 107 |
+
|
| 108 |
+
### Q5: Is there a soft-bottleneck schedule that beats the hard-then-no schedule?
|
| 109 |
+
|
| 110 |
+
Hypothesis: replacing the hard mask with a learnable scalar penalty (or scheduled annealing) might give a smoother training trajectory and possibly a better endpoint.
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## Suggested next experiments (ranked by expected information / hour)
|
| 115 |
+
|
| 116 |
+
1. **Q1: MATH no-block eval (~15 min).** Decisive test of whether the breakthrough generalizes beyond GSM8K. If positive β the recipe is genuinely useful. If negative β there's GSM8K-specific structure we're exploiting.
|
| 117 |
+
|
| 118 |
+
2. **Q2: No-block evals across training stages (~1 hour).** Characterizes when "transferable z" emerges. Cheap and gives a clean curve for the writeup.
|
| 119 |
+
|
| 120 |
+
3. **Q3: No-block-aware GRPO (~5β8 h).** Higher upside but speculative β could lift 77.5 β 80β85%. Implementation: a few-line change in `grpo_train.py`'s rollout sampler (pass `block_y_to_x=False`). Reference policy stays bottlenecked (KL anchor unchanged) so the policy gets RL signal aligned to no-block evaluation.
|
| 121 |
+
|
| 122 |
+
4. **Q4 mechanistic probes** (variable cost). Direct readout of z's information content via linear probes; comparison of xβy attention weights with/without z available; activation patching to test (a)/(b)/(c).
|
| 123 |
+
|
| 124 |
+
5. **Q5: Soft-bottleneck schedule (~5 h).** Implementation cost moderate; outcome uncertain.
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## Reproducibility
|
| 129 |
+
|
| 130 |
+
Public HF model repo: https://huggingface.co/LauraGG/blt-reasoner-pilot1
|
| 131 |
+
|
| 132 |
+
```
|
| 133 |
+
grpo_opt13/
|
| 134 |
+
final/
|
| 135 |
+
model/ # LoRA adapter (the 77.5% / 52.5% checkpoint)
|
| 136 |
+
projector.pt # MLP Ο (~90M params)
|
| 137 |
+
head.pt # InfoNCE head
|
| 138 |
+
ablation_n200_K16.json # AR with bottleneck (52.5%)
|
| 139 |
+
ablation_no_block_y_to_x.json # AR WITHOUT bottleneck (77.5%) β the breakthrough
|
| 140 |
+
ablation_teacher_forced.json
|
| 141 |
+
capacity_diagnostic.json
|
| 142 |
+
rank_on_math.json # stable_rank=4.12 on MATH problems (OOD)
|
| 143 |
+
exp7b_opt13/ # SFT phase that built the projector
|
| 144 |
+
pilot7b/ # original 7B pilot (no Options)
|
| 145 |
+
per_slot_exp/ # negative result
|
| 146 |
+
controls/ # ablation controls from 1.5B campaign
|
| 147 |
+
HANDOFF_BLT_BREAKTHROUGH_2026-05-19.md # this document
|
| 148 |
+
HANDOFF_BLT_REASONER_2026-05-17.md # earlier writeup
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
Resume on a fresh instance:
|
| 152 |
+
|
| 153 |
+
```bash
|
| 154 |
+
pip install transformers peft bitsandbytes datasets safetensors huggingface_hub
|
| 155 |
+
# To reproduce the 77.5% number:
|
| 156 |
+
python3 -m experiments.blt_reasoner.eval \
|
| 157 |
+
--ckpt LauraGG/blt-reasoner-pilot1:grpo_opt13/final \
|
| 158 |
+
--config experiments/blt_reasoner/configs/grpo_from_opt13.json \
|
| 159 |
+
--n 200 --K 16 --max_new_tokens 192 --temperature 0.0 \
|
| 160 |
+
--no_block_y_to_x
|
| 161 |
+
```
|