bitnet-1bitllm / notes /synth_alibi_offset_ablation.md

1bitllm code (checkpoints to follow)

4754707 verified 22 days ago

preview code

raw

history blame contribute delete

8.78 kB

Synth ALiBi × {offset, placement} ablation — explains 75M transfer failure

Setup

E1-Bk0 (softmax + ±1 + K-shift), d=128/4H/4L, n_symbols=24, 3000 steps, batch=64, seed=42. CPU run (no GPU available locally; VM down).

Offset axis (placement = L0)

ALiBi	offset	copy_acc	Global IPMR	Top-per-head
ON	1	1.6%	0.00%	(collapsed)
ON	2	1.7%	0.00%	(collapsed)
OFF	1	88-92%	95.86%	L3H3 69%, L0 ~46%
OFF	2	97-98%	99.86%	L2H2 97%, L3H2 93%

Placement axis (offset = 1)

ALiBi	K-shift layer	copy_acc	Global IPMR	Top-per-head
ON	L0	1.6%	0.00%	(collapsed)
ON	L3 (last)	0.9-1.6%	0.00%	(collapsed)
OFF	L0	88-92%	95.86%	L3H3 69%
OFF	L3 (last)	90%	98.98%	L1H3 82%, L1H0 81%

Combined finding

ALiBi flattens BOTH offset and placement axes — at synth scale, every ALiBi cell collapses to 0% IPMR regardless of K-shift configuration. The "late placement is better" and "offset>1 is better" findings exist ONLY in --no-alibi runs.

Key findings

ALiBi flattens the offset axis at synth scale

With ALiBi enabled, BOTH offsets collapse to ~2% copy_acc, 0% IPMR. The offset advantage seen in prior synth experiments was an artifact of running with --no-alibi. ALiBi blocks long-distance attention regardless of K-shift offset.

Without ALiBi, offset=2 still wins (replicates prior)

offset=2 → 99.86% IPMR with L2H2 at 97% top-per-head. offset=1 → 95.86% IPMR with L3H3 at only 69%. Mechanism: offset=1's K-vector smearing (K[t] = W_K(x[t] + x[t-1])) creates near-identical K vectors at adjacent positions, which sign-STE can't disambiguate. offset=2 spaces them out.

Why this explains the 75M transfer failure

Regime	Distance pattern	offset=2 advantage	Empirical
Synth + no-ALiBi (fixed dist 24)	long fixed	LARGE (+30pp top-head)	known
Synth + ALiBi (no induction at all)	suppressed	none (both 0%)	this expt
75M + ALiBi (natural variable dist 1-15)	short variable	none/reversed	known

The 75M empirical result (offset=1 beats offset=2 by 0.014 BPC at 5K steps) makes sense: at 75M with ALiBi, induction work happens at short distances where offset=1 (= prev-token, Olsson canonical) is genuinely correct. offset=2's advantage was specific to the no-ALiBi long-fixed-distance synth task — it doesn't transfer to natural text + ALiBi.

Full 9-seed × 4-placement sweep — LATE PLACEMENT WINS MONOTONICALLY

Sweep (seeds 0, 1, 2, 7, 13, 21, 42, 99, 100; placements L0-L3; 4-layer softmax+K-shift+ALiBi, n_sym=8, offset=1, 3000 steps).

Per-seed results

seed	L0 IPMR	L1 IPMR	L2 IPMR	L3 IPMR
0	0.7%	0.2%	100.0%	92.2%
1	13.6%	43.3%	8.0%	2.5%
2	24.3%	19.9%	80.6%	100.0%
7	0.0%	0.0%	0.2%	100.0%
13	3.8%	2.5%	100.0%	99.5%
21	11.4%	0.0%	0.5%	100.0%
42	0.0%	96.2%	0.2%	0.9%
99	0.0%	74.1%	0.7%	100.0%
100	0.2%	99.8%	9.6%	4.5%

Aggregate

L	mean copy	mean IPMR	seeds with >50% IPMR
L0 (first)	42.2%	6.0%	0/9
L1	64.2%	37.3%	3/9
L2	55.6%	33.3%	3/9
L3 (last)	80.7%	66.6%	6/9

Conclusions (4L)

Monotonic improvement from L0 to L3. L3 (last layer) reaches strong induction on 6/9 seeds; L0 reaches it on 0/9. Mid-layers L1/L2 are intermediate and lottery-distributed.

8-LAYER VERIFICATION — L0 is uniquely bad, ALL others work

3 seeds × {L0, L4, L7}:

seed	L0 IPMR	L4 IPMR	L7 IPMR
42	21.65%	100.00%	99.78%
7	1.79%	98.66%	99.33%
100	14.29%	100.00%	99.11%

Placement	Mean IPMR	Seeds with >50% IPMR
L0	12.6%	0/3
L4 (middle)	99.6%	3/3 deterministic
L7 (last)	99.4%	3/3 deterministic

Full 8L sweep at seed=42 — locates the cliff

Layer	IPMR
L0	21.65%
L1	99.78% ← cliff
L2	87.05%
L3	100.00%
L4	100.00%
L5	100.00%
L6	95.98%
L7	99.78%

L0 is uniquely bad. Every other placement (L1-L7) works at seed=42. The cliff is sharp: between L0 (22% IPMR) and L1 (99.8% IPMR).

L1 reliability check (4 more seeds at 8L L=1)

seed	8L L0 IPMR	8L L1 IPMR
0	(3.8% at 4L L0)	100.0%
1	(13.6% at 4L L0)	61.8%
7	1.8%	95.3%
42	21.7%	99.8%
100	14.3%	17.4% ← FAILS

L1 works on 4/5 seeds at 8L (>50% IPMR), but seed=100 hits a near-L0 trough (17.4% — barely better than its L0 14.3%). So L1 is mostly reliable but still has lottery-fail tail. L4 and L7 work on 3/3 seeds tested — strictly more reliable than L1.

Updated picture at 8L

Placement	Seeds with >50% IPMR	Reliability
L0	0/3	always fails
L1	4/5	mostly reliable, occasional fail
L4 (middle)	3/3	deterministic
L7 (last)	3/3	deterministic

The cliff between L0 and L1 holds (L1 is much better than L0 on average), but L4-L7 are strictly more reliable than L1. For 75M, the simplest "1-line change" L0 → L1 might still hit a bad seed; safer to go L0 → L5+.

This matches the original synth-no-ALiBi finding ("K-shift on any non-L0 layer reaches ~98-99% copy_acc"). The pattern is robust across:

ALiBi vs no-ALiBi
Long (n=24) vs short (n=8) induction distance
4 layers vs 8 layers (4L lottery vanishes at 8L)

Conclusion: K-shift L0 is uniquely suboptimal under sign-STE. ANY other placement enables sign-STE attention to form induction.

For 75M (12 layers): K-shift L1 (or any of L1-L11) should reliably beat L0. Single-seed 75M comparison should show clear ordering.

Placement axis prediction — SHARPENED at synth+ALiBi+short-distance (UNRELIABLE)

The pathological synth+ALiBi+n=24 result (everything = 0% IPMR) doesn't predict 75M behavior, where induction lives at distances 1-15 (ALiBi- permitted range).

Re-ran synth at n_symbols=8 (mimicking 75M's distance regime) with ALiBi enabled, sweeping placement and offset:

Placement sweep (offset=1, n=8, ALiBi ON, seed=42)

K-shift layer	copy_acc	Global IPMR
L0 (current v76)	9.2%	0.00%
L1	98.0%	96.21%
L2	15.8%	0.22%
L3 (last)	32.6%	0.89%

Offset sweep (placement=L0, n=8, ALiBi ON, seed=42)

offset	copy_acc	IPMR
1	9.2%	0.00%
2	37.9%	0.67%
3	32.4%	0.00%
5	17.6%	4.02%

Key insight

At synth+ALiBi+short-distance, K-shift L1 is dramatically better than L0 (96% vs 0% IPMR; 98% vs 9% copy_acc). L0 is uniquely bad. L2 and L3 are mediocre. The story is "L1 specifically wins", not "later layers win".

This is a sharper prediction than the prior outline's "L_(n-1) wins" claim (which came from synth+no-ALiBi+long-distance). For 75M (n_layers=12), the analog is K-shift L1 (second layer), NOT L11 (last layer).

Mechanism hypothesis

L0 receives raw token embeddings — limited positional structure. K-shift on L1 sees post-L0-attention features, which are content-mixed and provide richer K-vector geometry for sign-STE to discriminate.

L_late layers don't help because by then the residual stream has lost local positional structure; K-shift offset=1 there shifts a high-level content vector rather than a position-marked one.

Bottom line (post 9-seed sweep)

Synth+ALiBi+short-distance shows a real but lottery-distributed L1 advantage:

L1 IPMR > L0 IPMR on 4/9 seeds, mean Δ = +31pp (p≈0.04 one-sided)
L0 NEVER reaches strong induction (>50% IPMR), L1 reaches it 3/9 times
High variance: stdev of Δ_IPMR = 45.9 across seeds

This is consistent with "L1 unlocks a more capable regime that L0 cannot enter, but only ~33% of seed lotteries hit the jackpot at this scale".

Implication for 75M

If the monotonic pattern transfers, K-shift L11 (last) at 75M should robustly outperform L0:

Synth: L3 reaches strong induction on 6/9 seeds; L0 on 0/9
Mid-layers (L1, L2 analog of 75M's L1-L10) lottery-distributed
L_(n-1) is the most reliable placement

To verify at 75M (currently VM-blocked):

Single-seed K-shift L11 vs L0 head-to-head, 5K-10K steps
Even single seeds may show clear ordering since L11 hits strong induction more reliably than L0 (66% vs 0% in synth)

The current v76 recipe (RWSP + K-shift L0 + offset=1) may be leaving val_bpc on the table by ~0.05-0.20 BPC at 75M — testable when VM returns.