bitnet-1bitllm / notes /synth_alibi_offset_ablation.md
hidude562's picture
1bitllm code (checkpoints to follow)
4754707 verified

Synth ALiBi Γ— {offset, placement} ablation β€” explains 75M transfer failure

Setup

E1-Bk0 (softmax + Β±1 + K-shift), d=128/4H/4L, n_symbols=24, 3000 steps, batch=64, seed=42. CPU run (no GPU available locally; VM down).

Offset axis (placement = L0)

ALiBi offset copy_acc Global IPMR Top-per-head
ON 1 1.6% 0.00% (collapsed)
ON 2 1.7% 0.00% (collapsed)
OFF 1 88-92% 95.86% L3H3 69%, L0 ~46%
OFF 2 97-98% 99.86% L2H2 97%, L3H2 93%

Placement axis (offset = 1)

ALiBi K-shift layer copy_acc Global IPMR Top-per-head
ON L0 1.6% 0.00% (collapsed)
ON L3 (last) 0.9-1.6% 0.00% (collapsed)
OFF L0 88-92% 95.86% L3H3 69%
OFF L3 (last) 90% 98.98% L1H3 82%, L1H0 81%

Combined finding

ALiBi flattens BOTH offset and placement axes β€” at synth scale, every ALiBi cell collapses to 0% IPMR regardless of K-shift configuration. The "late placement is better" and "offset>1 is better" findings exist ONLY in --no-alibi runs.

Key findings

ALiBi flattens the offset axis at synth scale

With ALiBi enabled, BOTH offsets collapse to ~2% copy_acc, 0% IPMR. The offset advantage seen in prior synth experiments was an artifact of running with --no-alibi. ALiBi blocks long-distance attention regardless of K-shift offset.

Without ALiBi, offset=2 still wins (replicates prior)

offset=2 β†’ 99.86% IPMR with L2H2 at 97% top-per-head. offset=1 β†’ 95.86% IPMR with L3H3 at only 69%. Mechanism: offset=1's K-vector smearing (K[t] = W_K(x[t] + x[t-1])) creates near-identical K vectors at adjacent positions, which sign-STE can't disambiguate. offset=2 spaces them out.

Why this explains the 75M transfer failure

Regime Distance pattern offset=2 advantage Empirical
Synth + no-ALiBi (fixed dist 24) long fixed LARGE (+30pp top-head) known
Synth + ALiBi (no induction at all) suppressed none (both 0%) this expt
75M + ALiBi (natural variable dist 1-15) short variable none/reversed known

The 75M empirical result (offset=1 beats offset=2 by 0.014 BPC at 5K steps) makes sense: at 75M with ALiBi, induction work happens at short distances where offset=1 (= prev-token, Olsson canonical) is genuinely correct. offset=2's advantage was specific to the no-ALiBi long-fixed-distance synth task β€” it doesn't transfer to natural text + ALiBi.

Full 9-seed Γ— 4-placement sweep β€” LATE PLACEMENT WINS MONOTONICALLY

Sweep (seeds 0, 1, 2, 7, 13, 21, 42, 99, 100; placements L0-L3; 4-layer softmax+K-shift+ALiBi, n_sym=8, offset=1, 3000 steps).

Per-seed results

seed L0 IPMR L1 IPMR L2 IPMR L3 IPMR
0 0.7% 0.2% 100.0% 92.2%
1 13.6% 43.3% 8.0% 2.5%
2 24.3% 19.9% 80.6% 100.0%
7 0.0% 0.0% 0.2% 100.0%
13 3.8% 2.5% 100.0% 99.5%
21 11.4% 0.0% 0.5% 100.0%
42 0.0% 96.2% 0.2% 0.9%
99 0.0% 74.1% 0.7% 100.0%
100 0.2% 99.8% 9.6% 4.5%

Aggregate

L mean copy mean IPMR seeds with >50% IPMR
L0 (first) 42.2% 6.0% 0/9
L1 64.2% 37.3% 3/9
L2 55.6% 33.3% 3/9
L3 (last) 80.7% 66.6% 6/9

Conclusions (4L)

Monotonic improvement from L0 to L3. L3 (last layer) reaches strong induction on 6/9 seeds; L0 reaches it on 0/9. Mid-layers L1/L2 are intermediate and lottery-distributed.

8-LAYER VERIFICATION β€” L0 is uniquely bad, ALL others work

3 seeds Γ— {L0, L4, L7}:

seed L0 IPMR L4 IPMR L7 IPMR
42 21.65% 100.00% 99.78%
7 1.79% 98.66% 99.33%
100 14.29% 100.00% 99.11%
Placement Mean IPMR Seeds with >50% IPMR
L0 12.6% 0/3
L4 (middle) 99.6% 3/3 deterministic
L7 (last) 99.4% 3/3 deterministic

Full 8L sweep at seed=42 β€” locates the cliff

Layer IPMR
L0 21.65%
L1 99.78% ← cliff
L2 87.05%
L3 100.00%
L4 100.00%
L5 100.00%
L6 95.98%
L7 99.78%

L0 is uniquely bad. Every other placement (L1-L7) works at seed=42. The cliff is sharp: between L0 (22% IPMR) and L1 (99.8% IPMR).

L1 reliability check (4 more seeds at 8L L=1)

seed 8L L0 IPMR 8L L1 IPMR
0 (3.8% at 4L L0) 100.0%
1 (13.6% at 4L L0) 61.8%
7 1.8% 95.3%
42 21.7% 99.8%
100 14.3% 17.4% ← FAILS

L1 works on 4/5 seeds at 8L (>50% IPMR), but seed=100 hits a near-L0 trough (17.4% β€” barely better than its L0 14.3%). So L1 is mostly reliable but still has lottery-fail tail. L4 and L7 work on 3/3 seeds tested β€” strictly more reliable than L1.

Updated picture at 8L

Placement Seeds with >50% IPMR Reliability
L0 0/3 always fails
L1 4/5 mostly reliable, occasional fail
L4 (middle) 3/3 deterministic
L7 (last) 3/3 deterministic

The cliff between L0 and L1 holds (L1 is much better than L0 on average), but L4-L7 are strictly more reliable than L1. For 75M, the simplest "1-line change" L0 β†’ L1 might still hit a bad seed; safer to go L0 β†’ L5+.

This matches the original synth-no-ALiBi finding ("K-shift on any non-L0 layer reaches ~98-99% copy_acc"). The pattern is robust across:

  • ALiBi vs no-ALiBi
  • Long (n=24) vs short (n=8) induction distance
  • 4 layers vs 8 layers (4L lottery vanishes at 8L)

Conclusion: K-shift L0 is uniquely suboptimal under sign-STE. ANY other placement enables sign-STE attention to form induction.

For 75M (12 layers): K-shift L1 (or any of L1-L11) should reliably beat L0. Single-seed 75M comparison should show clear ordering.

Placement axis prediction β€” SHARPENED at synth+ALiBi+short-distance (UNRELIABLE)

The pathological synth+ALiBi+n=24 result (everything = 0% IPMR) doesn't predict 75M behavior, where induction lives at distances 1-15 (ALiBi- permitted range).

Re-ran synth at n_symbols=8 (mimicking 75M's distance regime) with ALiBi enabled, sweeping placement and offset:

Placement sweep (offset=1, n=8, ALiBi ON, seed=42)

K-shift layer copy_acc Global IPMR
L0 (current v76) 9.2% 0.00%
L1 98.0% 96.21%
L2 15.8% 0.22%
L3 (last) 32.6% 0.89%

Offset sweep (placement=L0, n=8, ALiBi ON, seed=42)

offset copy_acc IPMR
1 9.2% 0.00%
2 37.9% 0.67%
3 32.4% 0.00%
5 17.6% 4.02%

Key insight

At synth+ALiBi+short-distance, K-shift L1 is dramatically better than L0 (96% vs 0% IPMR; 98% vs 9% copy_acc). L0 is uniquely bad. L2 and L3 are mediocre. The story is "L1 specifically wins", not "later layers win".

This is a sharper prediction than the prior outline's "L_(n-1) wins" claim (which came from synth+no-ALiBi+long-distance). For 75M (n_layers=12), the analog is K-shift L1 (second layer), NOT L11 (last layer).

Mechanism hypothesis

L0 receives raw token embeddings β€” limited positional structure. K-shift on L1 sees post-L0-attention features, which are content-mixed and provide richer K-vector geometry for sign-STE to discriminate.

L_late layers don't help because by then the residual stream has lost local positional structure; K-shift offset=1 there shifts a high-level content vector rather than a position-marked one.

Bottom line (post 9-seed sweep)

Synth+ALiBi+short-distance shows a real but lottery-distributed L1 advantage:

  • L1 IPMR > L0 IPMR on 4/9 seeds, mean Ξ” = +31pp (pβ‰ˆ0.04 one-sided)
  • L0 NEVER reaches strong induction (>50% IPMR), L1 reaches it 3/9 times
  • High variance: stdev of Ξ”_IPMR = 45.9 across seeds

This is consistent with "L1 unlocks a more capable regime that L0 cannot enter, but only ~33% of seed lotteries hit the jackpot at this scale".

Implication for 75M

If the monotonic pattern transfers, K-shift L11 (last) at 75M should robustly outperform L0:

  • Synth: L3 reaches strong induction on 6/9 seeds; L0 on 0/9
  • Mid-layers (L1, L2 analog of 75M's L1-L10) lottery-distributed
  • L_(n-1) is the most reliable placement

To verify at 75M (currently VM-blocked):

  • Single-seed K-shift L11 vs L0 head-to-head, 5K-10K steps
  • Even single seeds may show clear ordering since L11 hits strong induction more reliably than L0 (66% vs 0% in synth)

The current v76 recipe (RWSP + K-shift L0 + offset=1) may be leaving val_bpc on the table by ~0.05-0.20 BPC at 75M β€” testable when VM returns.