| # 75M K-shift offset validation (synth β 75M transfer test) |
|
|
| ## Experiment |
|
|
| Two 75M FineWeb-Edu training runs differing **only in `--k-shift-offset`** (1 vs 2). |
| Both use softmax+K-shift L0 + ALiBi + 1500 steps with default v75 hyperparameters. |
|
|
| CLIs: |
| ```bash |
| NO_SAVE=1 python train_fineweb_v75.py \ |
| --k-shift --k-shift-layers 0 --k-shift-offset 2 \ |
| --max-steps 1500 --eval-interval 250 --eval-iters 8 \ |
| --tag v76b_75m_kshift_L0_off2 |
| |
| NO_SAVE=1 python train_fineweb_v75.py \ |
| --k-shift --k-shift-layers 0 --k-shift-offset 1 \ |
| --max-steps 1500 --eval-interval 250 --eval-iters 8 \ |
| --tag v76b_75m_kshift_L0_off1_baseline |
| ``` |
|
|
| GPU: RTX 5090 on vast.ai. Both runs ~35 min. |
|
|
| ## Results |
|
|
| | Step | offset=2 val_bpc | offset=1 val_bpc | Ξ (off2 β off1) | |
| |---|---|---|---| |
| | 250 | 10.3855 | 10.3779 | +0.008 | |
| | 500 | 9.6400 | 9.6341 | +0.006 | |
| | 750 | 9.1074 | 9.1029 | +0.005 | |
| | 1000 | 8.6848 | 8.7258 | **-0.041** β
| |
| | 1250 | 8.2600 | 8.2717 | **-0.012** β
| |
| | **1500** | **7.9719** | **7.9852** | **-0.013** β
| |
|
|
| ## Interpretation |
|
|
| **Synth prediction validated at 75M scale**: |
| - Direction: β
offset=2 consistently lower val_bpc from step 1000 onward |
| - Magnitude: small (-0.013 BPC at step 1500) |
| |
| The gap emerges around step 1000 (when learning rate exceeds 3e-4 in the |
| warmup ramp). Before that, both runs are virtually identical β expected |
| since lr is too small for sign flips to dominate. |
| |
| **Caveats**: |
| 1. **Both runs are still in warmup** at step 1500 (warmup_steps=2000). |
| At full LR (post step 2000), the gap may grow. |
| 2. **ALiBi attenuates the benefit**: synth (no-ALiBi) showed +50pp top-per-head |
| IPMR from offset switch; at 75M with ALiBi, val_bpc gap is ~0.013. The |
| mechanism (sharper sign-STE attention alignment) is the same, but ALiBi |
| limits the absolute distances over which the alignment matters. |
| 3. **Single seed each**. Run-to-run noise typically ~0.02-0.05 BPC at this |
| scale. The -0.013 BPC gap is at the edge of noise. Multi-seed runs would |
| strengthen the claim. |
| |
| **Next step to strengthen the result**: train both for 5000+ steps (full LR |
| post-warmup) to see if the gap widens. Predicted: yes, by 0.05-0.10 BPC |
| once both runs are at full lr. |
| |
| ## β 5000-step result β synth prediction REVERSED at 75M |
| |
| Ran both for 5000 steps (past warmup_steps=2000, full LR cosine to lr_end): |
| |
| | Step | offset=2 | offset=1 | Ξ (off2 - off1) | |
| |---|---|---|---| |
| | 500 | 9.6376 | 9.6338 | +0.004 | |
| | 1000 | 8.6589 | 8.6741 | -0.015 | |
| | 1500 | 7.9281 | 7.9606 | -0.033 | |
| | 2000 | 7.6452 | 7.5921 | **+0.053** | |
| | 2500 | 7.3224 | 7.3297 | -0.007 | |
| | 3000 | 7.1730 | 7.1523 | +0.021 | |
| | 3500 | 7.0398 | 7.0338 | +0.006 | |
| | 4000 | 6.9431 | 6.9256 | +0.018 | |
| | 4500 | 6.8545 | 6.8384 | +0.016 | |
| | **5000** | **6.7779** | **6.7639** | **+0.014** | |
| |
| **At 5000 steps, offset=1 BEATS offset=2 by 0.014 BPC** β synth prediction |
| *reversed* at this scale. The early offset=2 advantage (steps 1000-1500) is |
| erased post-warmup; offset=1 then takes a small but consistent lead. |
| |
| ### Honest interpretation |
| |
| The synth K-shift offset finding **does not transfer cleanly to 75M scale |
| with ALiBi + natural text**. Likely reasons: |
| |
| 1. **Run-to-run noise**: |Ξ| = 0.014 is within val_loss_std (~0.07). Single |
| seed each. Multi-seed runs would clarify, but the pattern across the |
| trajectory is consistent enough (offset=1 wins steps 2000-5000) that |
| noise alone probably can't explain it. |
| |
| 2. **ALiBi attenuation**: ALiBi already provides distance penalty. The |
| offset=1 weakness in synth was about "K vectors too smooth at distance 1 |
| making sign-STE alignment harder for distance-24 attention". With ALiBi, |
| the model isn't attending at distance 24+ much anyway, so this weakness |
| may not manifest. |
| |
| 3. **Natural text vs synthetic copy task**: synth's fixed-distance pattern |
| doesn't match natural text's variable induction distances. Most natural |
| induction is at distances 1-15; offset=1 (= prev-token routing, |
| Olsson canonical) might be genuinely optimal here. |
| |
| 4. **Training dynamics under cosine LR decay**: synth was constant LR; 75M |
| uses cosine decay from 6e-4 to 3e-5. Different optimization trajectory. |
| |
| ### Updated v76 paper recommendation |
| |
| The synth-derived "switch offset 1 β 2" recommendation **does not improve |
| val_bpc at 75M scale**. The current v76 K-shift offset=1 (canonical) is |
| already optimal for natural-text training. |
| |
| The synth findings remain valid as MECHANISTIC insights (offset=1 has |
| distinct K-vector smearing at fixed-distance copy), but they don't predict |
| val_bpc improvements at scale. The mechanism is real but not the right |
| optimization target for natural text. |
|
|
| **Outstanding**: the placement axis (K-shift L11 vs L0) hasn't been tested |
| at 75M yet. Synth showed placement was the bigger effect (50pp top-per-head |
| vs offset's 7pp). Worth testing. |
|
|
| ## V76 paper actionable recommendation (post 5000-step result) |
|
|
| **Keep current v76 K-shift offset=1.** The synth-derived "switch offset 1β2" |
| proposal does not improve val_bpc at 75M scale (offset=1 wins by 0.014 BPC |
| at 5000 steps). |
| |
| The synth K-shift offset axis is mechanistically interesting (explains why |
| sign-STE attention prefers different K shapes) but is **not a useful |
| optimization knob** for natural-text training under ALiBi. |
| |
| The broader synth-derived recommendations (try non-L0 placement, drop RWSP |
| for softmax) remain UNTESTED at 75M β placement test was launched but VM |
| dropped before completion. |
| |
| ## Statistical caveat (post-hoc analysis) |
| |
| The 5000-step Ξ_val_bpc = +0.014 (off2 worse) has val_loss_std = 0.094 at |
| step 5000. The observed gap is ~0.15Γ the std β **not statistically |
| significant**. Likewise the 1500-step Ξ = -0.013 (off2 better) is within |
| noise. With single seeds and these std values, neither direction can be |
| claimed at p<0.05 without 4-8 seed averaging. |
| |
| What this changes: |
| - The "offset=2 reverses at 75M" claim is overstated β it's "offset=2 |
| shows no significant difference at 75M". The reversal across train |
| steps is consistent with random-walk-around-zero given val_loss_std. |
| - The synth-derived recommendation isn't refuted; it's unsupported. |
| - A multi-seed test would clarify, but at 75M scale (~35 min/run on |
| vast.ai), 8-seed runs cost ~$15 and ~5 hours β small but not free. |
| |
| The empirical bottom line stands: **at single-seed 75M, offset=1 vs |
| offset=2 are indistinguishable**. The current v76 recipe is fine. |
| |