bitnet-1bitllm / notes /75m_kshift_offset_validation.md

1bitllm code (checkpoints to follow)

4754707 verified 29 days ago

6.42 kB

	# 75M K-shift offset validation (synth → 75M transfer test)

	## Experiment

	Two 75M FineWeb-Edu training runs differing only in `--k-shift-offset` (1 vs 2).
	Both use softmax+K-shift L0 + ALiBi + 1500 steps with default v75 hyperparameters.

	CLIs:
	```bash
	NO_SAVE=1 python train_fineweb_v75.py \
	--k-shift --k-shift-layers 0 --k-shift-offset 2 \
	--max-steps 1500 --eval-interval 250 --eval-iters 8 \
	--tag v76b_75m_kshift_L0_off2

	NO_SAVE=1 python train_fineweb_v75.py \
	--k-shift --k-shift-layers 0 --k-shift-offset 1 \
	--max-steps 1500 --eval-interval 250 --eval-iters 8 \
	--tag v76b_75m_kshift_L0_off1_baseline
	```

	GPU: RTX 5090 on vast.ai. Both runs ~35 min.

	## Results

	\| Step \| offset=2 val_bpc \| offset=1 val_bpc \| Δ (off2 − off1) \|
	\|---\|---\|---\|---\|
	\| 250 \| 10.3855 \| 10.3779 \| +0.008 \|
	\| 500 \| 9.6400 \| 9.6341 \| +0.006 \|
	\| 750 \| 9.1074 \| 9.1029 \| +0.005 \|
	\| 1000 \| 8.6848 \| 8.7258 \| -0.041 ✅ \|
	\| 1250 \| 8.2600 \| 8.2717 \| -0.012 ✅ \|
	\| 1500 \| 7.9719 \| 7.9852 \| -0.013 ✅ \|

	## Interpretation

	Synth prediction validated at 75M scale:
	- Direction: ✅ offset=2 consistently lower val_bpc from step 1000 onward
	- Magnitude: small (-0.013 BPC at step 1500)

	The gap emerges around step 1000 (when learning rate exceeds 3e-4 in the
	warmup ramp). Before that, both runs are virtually identical — expected
	since lr is too small for sign flips to dominate.

	Caveats:
	1. Both runs are still in warmup at step 1500 (warmup_steps=2000).
	At full LR (post step 2000), the gap may grow.
	2. ALiBi attenuates the benefit: synth (no-ALiBi) showed +50pp top-per-head
	IPMR from offset switch; at 75M with ALiBi, val_bpc gap is ~0.013. The
	mechanism (sharper sign-STE attention alignment) is the same, but ALiBi
	limits the absolute distances over which the alignment matters.
	3. Single seed each. Run-to-run noise typically ~0.02-0.05 BPC at this
	scale. The -0.013 BPC gap is at the edge of noise. Multi-seed runs would
	strengthen the claim.

	Next step to strengthen the result: train both for 5000+ steps (full LR
	post-warmup) to see if the gap widens. Predicted: yes, by 0.05-0.10 BPC
	once both runs are at full lr.

	## ⚠ 5000-step result — synth prediction REVERSED at 75M

	Ran both for 5000 steps (past warmup_steps=2000, full LR cosine to lr_end):

	\| Step \| offset=2 \| offset=1 \| Δ (off2 - off1) \|
	\|---\|---\|---\|---\|
	\| 500 \| 9.6376 \| 9.6338 \| +0.004 \|
	\| 1000 \| 8.6589 \| 8.6741 \| -0.015 \|
	\| 1500 \| 7.9281 \| 7.9606 \| -0.033 \|
	\| 2000 \| 7.6452 \| 7.5921 \| +0.053 \|
	\| 2500 \| 7.3224 \| 7.3297 \| -0.007 \|
	\| 3000 \| 7.1730 \| 7.1523 \| +0.021 \|
	\| 3500 \| 7.0398 \| 7.0338 \| +0.006 \|
	\| 4000 \| 6.9431 \| 6.9256 \| +0.018 \|
	\| 4500 \| 6.8545 \| 6.8384 \| +0.016 \|
	\| 5000 \| 6.7779 \| 6.7639 \| +0.014 \|

	At 5000 steps, offset=1 BEATS offset=2 by 0.014 BPC — synth prediction
	reversed at this scale. The early offset=2 advantage (steps 1000-1500) is
	erased post-warmup; offset=1 then takes a small but consistent lead.

	### Honest interpretation

	The synth K-shift offset finding **does not transfer cleanly to 75M scale
	with ALiBi + natural text**. Likely reasons:

	1. Run-to-run noise: \|Δ\| = 0.014 is within val_loss_std (~0.07). Single
	seed each. Multi-seed runs would clarify, but the pattern across the
	trajectory is consistent enough (offset=1 wins steps 2000-5000) that
	noise alone probably can't explain it.

	2. ALiBi attenuation: ALiBi already provides distance penalty. The
	offset=1 weakness in synth was about "K vectors too smooth at distance 1
	making sign-STE alignment harder for distance-24 attention". With ALiBi,
	the model isn't attending at distance 24+ much anyway, so this weakness
	may not manifest.

	3. Natural text vs synthetic copy task: synth's fixed-distance pattern
	doesn't match natural text's variable induction distances. Most natural
	induction is at distances 1-15; offset=1 (= prev-token routing,
	Olsson canonical) might be genuinely optimal here.

	4. Training dynamics under cosine LR decay: synth was constant LR; 75M
	uses cosine decay from 6e-4 to 3e-5. Different optimization trajectory.

	### Updated v76 paper recommendation

	The synth-derived "switch offset 1 → 2" recommendation **does not improve
	val_bpc at 75M scale**. The current v76 K-shift offset=1 (canonical) is
	already optimal for natural-text training.

	The synth findings remain valid as MECHANISTIC insights (offset=1 has
	distinct K-vector smearing at fixed-distance copy), but they don't predict
	val_bpc improvements at scale. The mechanism is real but not the right
	optimization target for natural text.

	Outstanding: the placement axis (K-shift L11 vs L0) hasn't been tested
	at 75M yet. Synth showed placement was the bigger effect (50pp top-per-head
	vs offset's 7pp). Worth testing.

	## V76 paper actionable recommendation (post 5000-step result)

	Keep current v76 K-shift offset=1. The synth-derived "switch offset 1→2"
	proposal does not improve val_bpc at 75M scale (offset=1 wins by 0.014 BPC
	at 5000 steps).

	The synth K-shift offset axis is mechanistically interesting (explains why
	sign-STE attention prefers different K shapes) but is **not a useful
	optimization knob** for natural-text training under ALiBi.

	The broader synth-derived recommendations (try non-L0 placement, drop RWSP
	for softmax) remain UNTESTED at 75M — placement test was launched but VM
	dropped before completion.

	## Statistical caveat (post-hoc analysis)

	The 5000-step Δ_val_bpc = +0.014 (off2 worse) has val_loss_std = 0.094 at
	step 5000. The observed gap is ~0.15× the std — **not statistically
	significant**. Likewise the 1500-step Δ = -0.013 (off2 better) is within
	noise. With single seeds and these std values, neither direction can be
	claimed at p<0.05 without 4-8 seed averaging.

	What this changes:
	- The "offset=2 reverses at 75M" claim is overstated — it's "offset=2
	shows no significant difference at 75M". The reversal across train
	steps is consistent with random-walk-around-zero given val_loss_std.
	- The synth-derived recommendation isn't refuted; it's unsupported.
	- A multi-seed test would clarify, but at 75M scale (~35 min/run on
	vast.ai), 8-seed runs cost ~$15 and ~5 hours — small but not free.

	The empirical bottom line stands: **at single-seed 75M, offset=1 vs
	offset=2 are indistinguishable**. The current v76 recipe is fine.