# Synth ALiBi × {offset, placement} ablation — explains 75M transfer failure

## Setup

E1-Bk0 (softmax + ±1 + K-shift), d=128/4H/4L, n_symbols=24, 3000 steps,
batch=64, seed=42. CPU run (no GPU available locally; VM down).

### Offset axis (placement = L0)

| ALiBi | offset | copy_acc | Global IPMR | Top-per-head |
|---|---|---|---|---|
| ON | 1 | 1.6% | 0.00% | (collapsed) |
| ON | 2 | 1.7% | 0.00% | (collapsed) |
| OFF | 1 | 88-92% | 95.86% | L3H3 69%, L0 ~46% |
| **OFF** | **2** | **97-98%** | **99.86%** | **L2H2 97%, L3H2 93%** |

### Placement axis (offset = 1)

| ALiBi | K-shift layer | copy_acc | Global IPMR | Top-per-head |
|---|---|---|---|---|
| ON | L0 | 1.6% | 0.00% | (collapsed) |
| ON | L3 (last) | 0.9-1.6% | 0.00% | (collapsed) |
| OFF | L0 | 88-92% | 95.86% | L3H3 69% |
| **OFF** | **L3 (last)** | **90%** | **98.98%** | **L1H3 82%, L1H0 81%** |

### Combined finding

**ALiBi flattens BOTH offset and placement axes** — at synth scale, every
ALiBi cell collapses to 0% IPMR regardless of K-shift configuration. The
"late placement is better" and "offset>1 is better" findings exist ONLY
in --no-alibi runs.

## Key findings

### ALiBi flattens the offset axis at synth scale

With ALiBi enabled, BOTH offsets collapse to ~2% copy_acc, 0% IPMR. The
offset advantage seen in prior synth experiments was an artifact of
running with `--no-alibi`. ALiBi blocks long-distance attention regardless
of K-shift offset.

### Without ALiBi, offset=2 still wins (replicates prior)

offset=2 → 99.86% IPMR with L2H2 at 97% top-per-head. offset=1 → 95.86%
IPMR with L3H3 at only 69%. Mechanism: offset=1's K-vector smearing
(K[t] = W_K(x[t] + x[t-1])) creates near-identical K vectors at adjacent
positions, which sign-STE can't disambiguate. offset=2 spaces them out.

### Why this explains the 75M transfer failure

| Regime | Distance pattern | offset=2 advantage | Empirical |
|---|---|---|---|
| Synth + no-ALiBi (fixed dist 24) | long fixed | LARGE (+30pp top-head) | known |
| **Synth + ALiBi (no induction at all)** | **suppressed** | **none (both 0%)** | **this expt** |
| 75M + ALiBi (natural variable dist 1-15) | short variable | none/reversed | known |

The 75M empirical result (offset=1 beats offset=2 by 0.014 BPC at 5K steps)
makes sense: at 75M with ALiBi, induction work happens at short distances
where offset=1 (= prev-token, Olsson canonical) is genuinely correct.
offset=2's advantage was specific to the no-ALiBi long-fixed-distance synth
task — it doesn't transfer to natural text + ALiBi.

## Full 9-seed × 4-placement sweep — LATE PLACEMENT WINS MONOTONICALLY

Sweep (seeds 0, 1, 2, 7, 13, 21, 42, 99, 100; placements L0-L3; 4-layer
softmax+K-shift+ALiBi, n_sym=8, offset=1, 3000 steps).

### Per-seed results

| seed | L0 IPMR | L1 IPMR | L2 IPMR | L3 IPMR |
|---|---|---|---|---|
| 0 | 0.7% | 0.2% | **100.0%** | **92.2%** |
| 1 | 13.6% | 43.3% | 8.0% | 2.5% |
| 2 | 24.3% | 19.9% | **80.6%** | **100.0%** |
| 7 | 0.0% | 0.0% | 0.2% | **100.0%** |
| 13 | 3.8% | 2.5% | **100.0%** | **99.5%** |
| 21 | 11.4% | 0.0% | 0.5% | **100.0%** |
| 42 | 0.0% | **96.2%** | 0.2% | 0.9% |
| 99 | 0.0% | **74.1%** | 0.7% | **100.0%** |
| 100 | 0.2% | **99.8%** | 9.6% | 4.5% |

### Aggregate

| L | mean copy | mean IPMR | seeds with >50% IPMR |
|---|---|---|---|
| L0 (first) | 42.2% | 6.0% | **0/9** |
| L1 | 64.2% | 37.3% | 3/9 |
| L2 | 55.6% | 33.3% | 3/9 |
| **L3 (last)** | **80.7%** | **66.6%** | **6/9** |

### Conclusions (4L)

**Monotonic improvement from L0 to L3.** L3 (last layer) reaches strong
induction on 6/9 seeds; L0 reaches it on 0/9. Mid-layers L1/L2 are
intermediate and lottery-distributed.

## 8-LAYER VERIFICATION — L0 is uniquely bad, ALL others work

3 seeds × {L0, L4, L7}:

| seed | L0 IPMR | L4 IPMR | L7 IPMR |
|---|---|---|---|
| 42 | 21.65% | **100.00%** | **99.78%** |
| 7 | 1.79% | **98.66%** | **99.33%** |
| 100 | 14.29% | **100.00%** | **99.11%** |

| Placement | Mean IPMR | Seeds with >50% IPMR |
|---|---|---|
| L0 | 12.6% | **0/3** |
| L4 (middle) | **99.6%** | **3/3 deterministic** |
| L7 (last) | **99.4%** | **3/3 deterministic** |

### Full 8L sweep at seed=42 — locates the cliff

| Layer | IPMR |
|---|---|
| L0 | 21.65% |
| **L1** | **99.78%** ← cliff |
| L2 | 87.05% |
| L3 | 100.00% |
| L4 | 100.00% |
| L5 | 100.00% |
| L6 | 95.98% |
| L7 | 99.78% |

**L0 is uniquely bad. Every other placement (L1-L7) works at seed=42.**
The cliff is sharp: between L0 (22% IPMR) and L1 (99.8% IPMR).

### L1 reliability check (4 more seeds at 8L L=1)

| seed | 8L L0 IPMR | 8L L1 IPMR |
|---|---|---|
| 0 | (3.8% at 4L L0) | **100.0%** |
| 1 | (13.6% at 4L L0) | 61.8% |
| 7 | 1.8% | **95.3%** |
| 42 | 21.7% | **99.8%** |
| 100 | 14.3% | **17.4%** ← FAILS |

L1 works on 4/5 seeds at 8L (>50% IPMR), but seed=100 hits a near-L0
trough (17.4% — barely better than its L0 14.3%). So **L1 is mostly
reliable but still has lottery-fail tail**. L4 and L7 work on 3/3 seeds
tested — strictly more reliable than L1.

### Updated picture at 8L

| Placement | Seeds with >50% IPMR | Reliability |
|---|---|---|
| L0 | 0/3 | always fails |
| L1 | 4/5 | mostly reliable, occasional fail |
| L4 (middle) | 3/3 | deterministic |
| L7 (last) | 3/3 | deterministic |

The cliff between L0 and L1 holds (L1 is much better than L0 on average),
but **L4-L7 are strictly more reliable than L1**. For 75M, the simplest
"1-line change" L0 → L1 might still hit a bad seed; safer to go L0 → L5+.

This matches the original synth-no-ALiBi finding ("K-shift on any non-L0
layer reaches ~98-99% copy_acc"). The pattern is robust across:
- ALiBi vs no-ALiBi
- Long (n=24) vs short (n=8) induction distance
- 4 layers vs 8 layers (4L lottery vanishes at 8L)

**Conclusion: K-shift L0 is uniquely suboptimal under sign-STE.** ANY
other placement enables sign-STE attention to form induction.

For 75M (12 layers): K-shift L1 (or any of L1-L11) should reliably beat
L0. Single-seed 75M comparison should show clear ordering.

## Placement axis prediction — SHARPENED at synth+ALiBi+short-distance (UNRELIABLE)

The pathological synth+ALiBi+n=24 result (everything = 0% IPMR) doesn't
predict 75M behavior, where induction lives at distances 1-15 (ALiBi-
permitted range).

Re-ran synth at **n_symbols=8** (mimicking 75M's distance regime) with
ALiBi enabled, sweeping placement and offset:

### Placement sweep (offset=1, n=8, ALiBi ON, seed=42)

| K-shift layer | copy_acc | Global IPMR |
|---|---|---|
| L0 (current v76) | 9.2% | 0.00% |
| **L1** | **98.0%** | **96.21%** |
| L2 | 15.8% | 0.22% |
| L3 (last) | 32.6% | 0.89% |

### Offset sweep (placement=L0, n=8, ALiBi ON, seed=42)

| offset | copy_acc | IPMR |
|---|---|---|
| 1 | 9.2% | 0.00% |
| 2 | 37.9% | 0.67% |
| 3 | 32.4% | 0.00% |
| 5 | 17.6% | 4.02% |

### Key insight

At synth+ALiBi+short-distance, **K-shift L1 is dramatically better than L0**
(96% vs 0% IPMR; 98% vs 9% copy_acc). L0 is uniquely bad. L2 and L3 are
mediocre. **The story is "L1 specifically wins", not "later layers win".**

This is a sharper prediction than the prior outline's "L_(n-1) wins" claim
(which came from synth+no-ALiBi+long-distance). For 75M (n_layers=12), the
analog is K-shift L1 (second layer), NOT L11 (last layer).

### Mechanism hypothesis

L0 receives raw token embeddings — limited positional structure. K-shift
on L1 sees post-L0-attention features, which are content-mixed and provide
richer K-vector geometry for sign-STE to discriminate.

L_late layers don't help because by then the residual stream has lost
local positional structure; K-shift offset=1 there shifts a high-level
content vector rather than a position-marked one.

## Bottom line (post 9-seed sweep)

Synth+ALiBi+short-distance shows a real but lottery-distributed L1
advantage:
- L1 IPMR > L0 IPMR on 4/9 seeds, mean Δ = +31pp (p≈0.04 one-sided)
- L0 NEVER reaches strong induction (>50% IPMR), L1 reaches it 3/9 times
- High variance: stdev of Δ_IPMR = 45.9 across seeds

This is consistent with "L1 unlocks a more capable regime that L0 cannot
enter, but only ~33% of seed lotteries hit the jackpot at this scale".

### Implication for 75M

If the monotonic pattern transfers, **K-shift L11 (last) at 75M should
robustly outperform L0**:
- Synth: L3 reaches strong induction on 6/9 seeds; L0 on 0/9
- Mid-layers (L1, L2 analog of 75M's L1-L10) lottery-distributed
- L_(n-1) is the most reliable placement

To verify at 75M (currently VM-blocked):
- Single-seed K-shift L11 vs L0 head-to-head, 5K-10K steps
- Even single seeds may show clear ordering since L11 hits strong induction
  more reliably than L0 (66% vs 0% in synth)

The current v76 recipe (RWSP + K-shift L0 + offset=1) may be leaving
val_bpc on the table by ~0.05-0.20 BPC at 75M — testable when VM returns.