bitnet-1bitllm / notes /synth_alibi_offset_ablation.md

1bitllm code (checkpoints to follow)

4754707 verified 23 days ago

8.78 kB

	# Synth ALiBi × {offset, placement} ablation — explains 75M transfer failure

	## Setup

	E1-Bk0 (softmax + ±1 + K-shift), d=128/4H/4L, n_symbols=24, 3000 steps,
	batch=64, seed=42. CPU run (no GPU available locally; VM down).

	### Offset axis (placement = L0)

	\| ALiBi \| offset \| copy_acc \| Global IPMR \| Top-per-head \|
	\|---\|---\|---\|---\|---\|
	\| ON \| 1 \| 1.6% \| 0.00% \| (collapsed) \|
	\| ON \| 2 \| 1.7% \| 0.00% \| (collapsed) \|
	\| OFF \| 1 \| 88-92% \| 95.86% \| L3H3 69%, L0 ~46% \|
	\| OFF \| 2 \| 97-98% \| 99.86% \| L2H2 97%, L3H2 93% \|

	### Placement axis (offset = 1)

	\| ALiBi \| K-shift layer \| copy_acc \| Global IPMR \| Top-per-head \|
	\|---\|---\|---\|---\|---\|
	\| ON \| L0 \| 1.6% \| 0.00% \| (collapsed) \|
	\| ON \| L3 (last) \| 0.9-1.6% \| 0.00% \| (collapsed) \|
	\| OFF \| L0 \| 88-92% \| 95.86% \| L3H3 69% \|
	\| OFF \| L3 (last) \| 90% \| 98.98% \| L1H3 82%, L1H0 81% \|

	### Combined finding

	ALiBi flattens BOTH offset and placement axes — at synth scale, every
	ALiBi cell collapses to 0% IPMR regardless of K-shift configuration. The
	"late placement is better" and "offset>1 is better" findings exist ONLY
	in --no-alibi runs.

	## Key findings

	### ALiBi flattens the offset axis at synth scale

	With ALiBi enabled, BOTH offsets collapse to ~2% copy_acc, 0% IPMR. The
	offset advantage seen in prior synth experiments was an artifact of
	running with `--no-alibi`. ALiBi blocks long-distance attention regardless
	of K-shift offset.

	### Without ALiBi, offset=2 still wins (replicates prior)

	offset=2 → 99.86% IPMR with L2H2 at 97% top-per-head. offset=1 → 95.86%
	IPMR with L3H3 at only 69%. Mechanism: offset=1's K-vector smearing
	(K[t] = W_K(x[t] + x[t-1])) creates near-identical K vectors at adjacent
	positions, which sign-STE can't disambiguate. offset=2 spaces them out.

	### Why this explains the 75M transfer failure

	\| Regime \| Distance pattern \| offset=2 advantage \| Empirical \|
	\|---\|---\|---\|---\|
	\| Synth + no-ALiBi (fixed dist 24) \| long fixed \| LARGE (+30pp top-head) \| known \|
	\| Synth + ALiBi (no induction at all) \| suppressed \| none (both 0%) \| this expt \|
	\| 75M + ALiBi (natural variable dist 1-15) \| short variable \| none/reversed \| known \|

	The 75M empirical result (offset=1 beats offset=2 by 0.014 BPC at 5K steps)
	makes sense: at 75M with ALiBi, induction work happens at short distances
	where offset=1 (= prev-token, Olsson canonical) is genuinely correct.
	offset=2's advantage was specific to the no-ALiBi long-fixed-distance synth
	task — it doesn't transfer to natural text + ALiBi.

	## Full 9-seed × 4-placement sweep — LATE PLACEMENT WINS MONOTONICALLY

	Sweep (seeds 0, 1, 2, 7, 13, 21, 42, 99, 100; placements L0-L3; 4-layer
	softmax+K-shift+ALiBi, n_sym=8, offset=1, 3000 steps).

	### Per-seed results

	\| seed \| L0 IPMR \| L1 IPMR \| L2 IPMR \| L3 IPMR \|
	\|---\|---\|---\|---\|---\|
	\| 0 \| 0.7% \| 0.2% \| 100.0% \| 92.2% \|
	\| 1 \| 13.6% \| 43.3% \| 8.0% \| 2.5% \|
	\| 2 \| 24.3% \| 19.9% \| 80.6% \| 100.0% \|
	\| 7 \| 0.0% \| 0.0% \| 0.2% \| 100.0% \|
	\| 13 \| 3.8% \| 2.5% \| 100.0% \| 99.5% \|
	\| 21 \| 11.4% \| 0.0% \| 0.5% \| 100.0% \|
	\| 42 \| 0.0% \| 96.2% \| 0.2% \| 0.9% \|
	\| 99 \| 0.0% \| 74.1% \| 0.7% \| 100.0% \|
	\| 100 \| 0.2% \| 99.8% \| 9.6% \| 4.5% \|

	### Aggregate

	\| L \| mean copy \| mean IPMR \| seeds with >50% IPMR \|
	\|---\|---\|---\|---\|
	\| L0 (first) \| 42.2% \| 6.0% \| 0/9 \|
	\| L1 \| 64.2% \| 37.3% \| 3/9 \|
	\| L2 \| 55.6% \| 33.3% \| 3/9 \|
	\| L3 (last) \| 80.7% \| 66.6% \| 6/9 \|

	### Conclusions (4L)

	Monotonic improvement from L0 to L3. L3 (last layer) reaches strong
	induction on 6/9 seeds; L0 reaches it on 0/9. Mid-layers L1/L2 are
	intermediate and lottery-distributed.

	## 8-LAYER VERIFICATION — L0 is uniquely bad, ALL others work

	3 seeds × {L0, L4, L7}:

	\| seed \| L0 IPMR \| L4 IPMR \| L7 IPMR \|
	\|---\|---\|---\|---\|
	\| 42 \| 21.65% \| 100.00% \| 99.78% \|
	\| 7 \| 1.79% \| 98.66% \| 99.33% \|
	\| 100 \| 14.29% \| 100.00% \| 99.11% \|

	\| Placement \| Mean IPMR \| Seeds with >50% IPMR \|
	\|---\|---\|---\|
	\| L0 \| 12.6% \| 0/3 \|
	\| L4 (middle) \| 99.6% \| 3/3 deterministic \|
	\| L7 (last) \| 99.4% \| 3/3 deterministic \|

	### Full 8L sweep at seed=42 — locates the cliff

	\| Layer \| IPMR \|
	\|---\|---\|
	\| L0 \| 21.65% \|
	\| L1 \| 99.78% ← cliff \|
	\| L2 \| 87.05% \|
	\| L3 \| 100.00% \|
	\| L4 \| 100.00% \|
	\| L5 \| 100.00% \|
	\| L6 \| 95.98% \|
	\| L7 \| 99.78% \|

	L0 is uniquely bad. Every other placement (L1-L7) works at seed=42.
	The cliff is sharp: between L0 (22% IPMR) and L1 (99.8% IPMR).

	### L1 reliability check (4 more seeds at 8L L=1)

	\| seed \| 8L L0 IPMR \| 8L L1 IPMR \|
	\|---\|---\|---\|
	\| 0 \| (3.8% at 4L L0) \| 100.0% \|
	\| 1 \| (13.6% at 4L L0) \| 61.8% \|
	\| 7 \| 1.8% \| 95.3% \|
	\| 42 \| 21.7% \| 99.8% \|
	\| 100 \| 14.3% \| 17.4% ← FAILS \|

	L1 works on 4/5 seeds at 8L (>50% IPMR), but seed=100 hits a near-L0
	trough (17.4% — barely better than its L0 14.3%). So **L1 is mostly
	reliable but still has lottery-fail tail**. L4 and L7 work on 3/3 seeds
	tested — strictly more reliable than L1.

	### Updated picture at 8L

	\| Placement \| Seeds with >50% IPMR \| Reliability \|
	\|---\|---\|---\|
	\| L0 \| 0/3 \| always fails \|
	\| L1 \| 4/5 \| mostly reliable, occasional fail \|
	\| L4 (middle) \| 3/3 \| deterministic \|
	\| L7 (last) \| 3/3 \| deterministic \|

	The cliff between L0 and L1 holds (L1 is much better than L0 on average),
	but L4-L7 are strictly more reliable than L1. For 75M, the simplest
	"1-line change" L0 → L1 might still hit a bad seed; safer to go L0 → L5+.

	This matches the original synth-no-ALiBi finding ("K-shift on any non-L0
	layer reaches ~98-99% copy_acc"). The pattern is robust across:
	- ALiBi vs no-ALiBi
	- Long (n=24) vs short (n=8) induction distance
	- 4 layers vs 8 layers (4L lottery vanishes at 8L)

	Conclusion: K-shift L0 is uniquely suboptimal under sign-STE. ANY
	other placement enables sign-STE attention to form induction.

	For 75M (12 layers): K-shift L1 (or any of L1-L11) should reliably beat
	L0. Single-seed 75M comparison should show clear ordering.

	## Placement axis prediction — SHARPENED at synth+ALiBi+short-distance (UNRELIABLE)

	The pathological synth+ALiBi+n=24 result (everything = 0% IPMR) doesn't
	predict 75M behavior, where induction lives at distances 1-15 (ALiBi-
	permitted range).

	Re-ran synth at n_symbols=8 (mimicking 75M's distance regime) with
	ALiBi enabled, sweeping placement and offset:

	### Placement sweep (offset=1, n=8, ALiBi ON, seed=42)

	\| K-shift layer \| copy_acc \| Global IPMR \|
	\|---\|---\|---\|
	\| L0 (current v76) \| 9.2% \| 0.00% \|
	\| L1 \| 98.0% \| 96.21% \|
	\| L2 \| 15.8% \| 0.22% \|
	\| L3 (last) \| 32.6% \| 0.89% \|

	### Offset sweep (placement=L0, n=8, ALiBi ON, seed=42)

	\| offset \| copy_acc \| IPMR \|
	\|---\|---\|---\|
	\| 1 \| 9.2% \| 0.00% \|
	\| 2 \| 37.9% \| 0.67% \|
	\| 3 \| 32.4% \| 0.00% \|
	\| 5 \| 17.6% \| 4.02% \|

	### Key insight

	At synth+ALiBi+short-distance, K-shift L1 is dramatically better than L0
	(96% vs 0% IPMR; 98% vs 9% copy_acc). L0 is uniquely bad. L2 and L3 are
	mediocre. The story is "L1 specifically wins", not "later layers win".

	This is a sharper prediction than the prior outline's "L_(n-1) wins" claim
	(which came from synth+no-ALiBi+long-distance). For 75M (n_layers=12), the
	analog is K-shift L1 (second layer), NOT L11 (last layer).

	### Mechanism hypothesis

	L0 receives raw token embeddings — limited positional structure. K-shift
	on L1 sees post-L0-attention features, which are content-mixed and provide
	richer K-vector geometry for sign-STE to discriminate.

	L_late layers don't help because by then the residual stream has lost
	local positional structure; K-shift offset=1 there shifts a high-level
	content vector rather than a position-marked one.

	## Bottom line (post 9-seed sweep)

	Synth+ALiBi+short-distance shows a real but lottery-distributed L1
	advantage:
	- L1 IPMR > L0 IPMR on 4/9 seeds, mean Δ = +31pp (p≈0.04 one-sided)
	- L0 NEVER reaches strong induction (>50% IPMR), L1 reaches it 3/9 times
	- High variance: stdev of Δ_IPMR = 45.9 across seeds

	This is consistent with "L1 unlocks a more capable regime that L0 cannot
	enter, but only ~33% of seed lotteries hit the jackpot at this scale".

	### Implication for 75M

	If the monotonic pattern transfers, **K-shift L11 (last) at 75M should
	robustly outperform L0**:
	- Synth: L3 reaches strong induction on 6/9 seeds; L0 on 0/9
	- Mid-layers (L1, L2 analog of 75M's L1-L10) lottery-distributed
	- L_(n-1) is the most reliable placement

	To verify at 75M (currently VM-blocked):
	- Single-seed K-shift L11 vs L0 head-to-head, 5K-10K steps
	- Even single seeds may show clear ordering since L11 hits strong induction
	more reliably than L0 (66% vs 0% in synth)

	The current v76 recipe (RWSP + K-shift L0 + offset=1) may be leaving
	val_bpc on the table by ~0.05-0.20 BPC at 75M — testable when VM returns.