75M K-shift offset validation (synth → 75M transfer test)

Experiment

Two 75M FineWeb-Edu training runs differing only in --k-shift-offset (1 vs 2). Both use softmax+K-shift L0 + ALiBi + 1500 steps with default v75 hyperparameters.

CLIs:

NO_SAVE=1 python train_fineweb_v75.py \
    --k-shift --k-shift-layers 0 --k-shift-offset 2 \
    --max-steps 1500 --eval-interval 250 --eval-iters 8 \
    --tag v76b_75m_kshift_L0_off2

NO_SAVE=1 python train_fineweb_v75.py \
    --k-shift --k-shift-layers 0 --k-shift-offset 1 \
    --max-steps 1500 --eval-interval 250 --eval-iters 8 \
    --tag v76b_75m_kshift_L0_off1_baseline

GPU: RTX 5090 on vast.ai. Both runs ~35 min.

Results

Step	offset=2 val_bpc	offset=1 val_bpc	Δ (off2 − off1)
250	10.3855	10.3779	+0.008
500	9.6400	9.6341	+0.006
750	9.1074	9.1029	+0.005
1000	8.6848	8.7258	-0.041 ✅
1250	8.2600	8.2717	-0.012 ✅
1500	7.9719	7.9852	-0.013 ✅

Interpretation

Synth prediction validated at 75M scale:

Direction: ✅ offset=2 consistently lower val_bpc from step 1000 onward
Magnitude: small (-0.013 BPC at step 1500)

The gap emerges around step 1000 (when learning rate exceeds 3e-4 in the warmup ramp). Before that, both runs are virtually identical — expected since lr is too small for sign flips to dominate.

Caveats:

Both runs are still in warmup at step 1500 (warmup_steps=2000). At full LR (post step 2000), the gap may grow.
ALiBi attenuates the benefit: synth (no-ALiBi) showed +50pp top-per-head IPMR from offset switch; at 75M with ALiBi, val_bpc gap is ~0.013. The mechanism (sharper sign-STE attention alignment) is the same, but ALiBi limits the absolute distances over which the alignment matters.
Single seed each. Run-to-run noise typically ~0.02-0.05 BPC at this scale. The -0.013 BPC gap is at the edge of noise. Multi-seed runs would strengthen the claim.

Next step to strengthen the result: train both for 5000+ steps (full LR post-warmup) to see if the gap widens. Predicted: yes, by 0.05-0.10 BPC once both runs are at full lr.

⚠ 5000-step result — synth prediction REVERSED at 75M

Ran both for 5000 steps (past warmup_steps=2000, full LR cosine to lr_end):

Step	offset=2	offset=1	Δ (off2 - off1)
500	9.6376	9.6338	+0.004
1000	8.6589	8.6741	-0.015
1500	7.9281	7.9606	-0.033
2000	7.6452	7.5921	+0.053
2500	7.3224	7.3297	-0.007
3000	7.1730	7.1523	+0.021
3500	7.0398	7.0338	+0.006
4000	6.9431	6.9256	+0.018
4500	6.8545	6.8384	+0.016
5000	6.7779	6.7639	+0.014

At 5000 steps, offset=1 BEATS offset=2 by 0.014 BPC — synth prediction reversed at this scale. The early offset=2 advantage (steps 1000-1500) is erased post-warmup; offset=1 then takes a small but consistent lead.

Honest interpretation

The synth K-shift offset finding does not transfer cleanly to 75M scale with ALiBi + natural text. Likely reasons:

Run-to-run noise: |Δ| = 0.014 is within val_loss_std (~0.07). Single seed each. Multi-seed runs would clarify, but the pattern across the trajectory is consistent enough (offset=1 wins steps 2000-5000) that noise alone probably can't explain it.
ALiBi attenuation: ALiBi already provides distance penalty. The offset=1 weakness in synth was about "K vectors too smooth at distance 1 making sign-STE alignment harder for distance-24 attention". With ALiBi, the model isn't attending at distance 24+ much anyway, so this weakness may not manifest.
Natural text vs synthetic copy task: synth's fixed-distance pattern doesn't match natural text's variable induction distances. Most natural induction is at distances 1-15; offset=1 (= prev-token routing, Olsson canonical) might be genuinely optimal here.
Training dynamics under cosine LR decay: synth was constant LR; 75M uses cosine decay from 6e-4 to 3e-5. Different optimization trajectory.

Updated v76 paper recommendation

The synth-derived "switch offset 1 → 2" recommendation does not improve val_bpc at 75M scale. The current v76 K-shift offset=1 (canonical) is already optimal for natural-text training.

The synth findings remain valid as MECHANISTIC insights (offset=1 has distinct K-vector smearing at fixed-distance copy), but they don't predict val_bpc improvements at scale. The mechanism is real but not the right optimization target for natural text.

Outstanding: the placement axis (K-shift L11 vs L0) hasn't been tested at 75M yet. Synth showed placement was the bigger effect (50pp top-per-head vs offset's 7pp). Worth testing.

V76 paper actionable recommendation (post 5000-step result)

Keep current v76 K-shift offset=1. The synth-derived "switch offset 1→2" proposal does not improve val_bpc at 75M scale (offset=1 wins by 0.014 BPC at 5000 steps).

The synth K-shift offset axis is mechanistically interesting (explains why sign-STE attention prefers different K shapes) but is not a useful optimization knob for natural-text training under ALiBi.

The broader synth-derived recommendations (try non-L0 placement, drop RWSP for softmax) remain UNTESTED at 75M — placement test was launched but VM dropped before completion.

Statistical caveat (post-hoc analysis)

The 5000-step Δ_val_bpc = +0.014 (off2 worse) has val_loss_std = 0.094 at step 5000. The observed gap is ~0.15× the std — not statistically significant. Likewise the 1500-step Δ = -0.013 (off2 better) is within noise. With single seeds and these std values, neither direction can be claimed at p<0.05 without 4-8 seed averaging.

What this changes:

The "offset=2 reverses at 75M" claim is overstated — it's "offset=2 shows no significant difference at 75M". The reversal across train steps is consistent with random-walk-around-zero given val_loss_std.
The synth-derived recommendation isn't refuted; it's unsupported.
A multi-seed test would clarify, but at 75M scale (~35 min/run on vast.ai), 8-seed runs cost ~$15 and ~5 hours — small but not free.

The empirical bottom line stands: at single-seed 75M, offset=1 vs offset=2 are indistinguishable. The current v76 recipe is fine.