Buckets:

thomas-schweich
/

pretraining-ablations

Files

xet

thomas-schweich/pretraining-ablations / lab-notes.md

thomas-schweich

about 1 month ago

preview code

download

raw

12.6 kB

	# Pretraining Ablations Lab Notes

	## Experiment: mate-boost, no-outcome, discard-ply-limit
	Started: 2026-04-03
	Pod: 3x RTX 6000 Ada (48GB each), $2.31/hr

	## Phase 1: LR Exploration (5K steps each, no_compile)

	### Batch 1: mate-boost (DONE)

	\| Trial \| LR \| val_loss \| val_acc \| top5 \| ppl \| legal% \| Note \|
	\|-------\|----\|---------\|---------\|------\|-----\|--------\|------\|
	\| 15 \| 1e-4 \| 4.712 \| 3.6% \| 15.5% \| 111.3 \| 73.7% \| Dominated, still in warmup \|
	\| 16 \| 3e-4 \| 4.019 \| 4.4% \| 19.3% \| 55.6 \| 89.1% \| Solid \|
	\| 17 \| 1e-3 \| 3.629 \| 5.2% \| 21.6% \| 37.7 \| 93.3% \| Winner \|

	### Batch 2: no-outcome (DONE)

	\| Trial \| LR \| Warmup \| val_loss \| val_acc \| top5 \| ppl \| legal% \| Note \|
	\|-------\|----\|--------\|---------\|---------\|------\|-----\|--------\|------\|
	\| 18 \| 1e-4 \| 5% \| 4.585 \| 4.3% \| 18.4% \| 98.0 \| 76.6% \| Dominated, still in warmup \|
	\| 19 \| 3e-4 \| 0% \| 3.444 \| 5.8% \| 24.3% \| 31.3 \| 94.2% \| Winner (no-warmup) \|
	\| 20 \| 1e-3 \| 5% \| 3.529 \| 5.6% \| 24.3% \| 34.1 \| 94.5% \| Close second, still warming \|

	### Batch 3: discard-ply-limit (DONE)

	\| Trial \| LR \| val_loss \| val_acc \| top5 \| ppl \| legal% \| Note \|
	\|-------\|----\|---------\|---------\|------\|-----\|--------\|------\|
	\| 21 \| 1e-4 \| 4.743 \| 3.7% \| 16.3% \| 114.7 \| 72.8% \| Dominated, still in warmup \|
	\| 22 \| 3e-4 \| 4.054 \| 4.6% \| 20.1% \| 57.7 \| 87.4% \| Middle ground \|
	\| 23 \| 1e-3 \| 3.617 \| 5.4% \| 22.6% \| 37.2 \| 93.2% \| Winner \|

	## Phase 2: Full Training (200K steps, torch.compile ON)

	\| Trial \| Ablation \| LR \| Resumed From \| GPU \| Status \|
	\|-------\|----------\|-----\|-------------\|-----\|--------\|
	\| 24 \| mate-boost \| 1e-3 \| trial 17 @ 5K \| 0 \| RUNNING \|
	\| 25 \| no-outcome \| 3e-4 \| trial 19 @ 5K \| 1 \| RUNNING \|
	\| 26 \| discard-ply-limit \| 1e-3 \| trial 23 @ 5K \| 2 \| RUNNING \|

	torch.compile will take 15-30 min to compile. After that, expect ~2.5 sps → ~22h for 195K steps.
	ETA for completion: ~2026-04-04 04:30 UTC.

	## Known Issues

	- pause_after_steps not working: Trials 15-17 continued past 5K steps. Had to kill manually. Will need to kill batch 2 manually too after 5K eval.

	## Log

	### 2026-04-03 04:38 UTC
	- Disabled MPS (was funneling all trials to GPU 0)
	- Removed sdpa_math from configs (NVIDIA GPUs use flash attention)
	- Launched batch 1: mate-boost x3 LRs

	### 2026-04-03 05:22 UTC
	- Batch 1 complete. lr=1e-3 dominates across all metrics.
	- pause_after_steps bug: trials continued past 5K, had to kill at ~5200.
	- Killed batch 1, launched batch 2: no-outcome x3 LRs.
	- Trial 19 (lr=3e-4) uses warmup_frac=0 to test no-warmup per user request.
	- ETA for batch 2 eval: ~05:57 UTC

	### 2026-04-03 05:50 UTC
	- New session picked up. Batch 2 (no-outcome) running: trials 18/19/20 at step ~4000.
	- Trial 18 (lr=1e-4): step 4000, train_loss=4.876, still in warmup (lr=4e-5), val_loss@2.5K=5.455
	- Trial 19 (lr=3e-4, no-warmup): step 3900, train_loss=3.514, lr=3e-4 (flat), val_loss@2.5K=3.710
	- Trial 20 (lr=1e-3): step 3900, train_loss=3.700, still warming (lr=3.9e-4), val_loss@2.5K=4.102
	- Set up 5-min cron to catch 5K eval, kill trials, launch batch 3.
	- Cost so far: $3.36 (1h27m elapsed)

	### 2026-04-03 05:57 UTC
	- Batch 2 (no-outcome) complete at 5K steps. Results:
	- lr=1e-4: val_loss=4.585 (dominated, warmup too slow)
	- lr=3e-4 (no warmup): val_loss=3.444, acc=5.8% — WINNER
	- lr=1e-3: val_loss=3.529 — close second but still warming up
	- Killed all 3, launched batch 3 (discard-ply-limit): trials 21/22/23
	- Note: discard-ply-limit discards ~60% of games. Watch step times for slowdown.
	- ETA for batch 3 eval: ~06:30 UTC

	### 2026-04-03 06:39 UTC
	- Batch 3 (discard-ply-limit) complete at 5K steps. Results:
	- lr=1e-4: val_loss=4.743 (dominated)
	- lr=3e-4: val_loss=4.054 (middle)
	- lr=1e-3: val_loss=3.617, acc=5.4% — WINNER
	- Phase 1 COMPLETE. All 3 ablations: lr=1e-3 wins for mate-boost and discard-ply-limit, lr=3e-4 wins for no-outcome.
	- Interesting: no-outcome at 3e-4 (val_loss=3.444) beats both other ablations at 1e-3 at 5K steps.
	- Launched Phase 2: trials 24/25/26 resuming from 5K checkpoints, torch.compile ON.
	- Switched cron from 5-min to hourly (long runs, ~22h remaining).
	- Note: 1e-4 consistently dominated across ALL ablations. 5% warmup on 200K steps means lr only reaches 5e-5 by step 5K. Consider shorter warmup for future experiments.

	## Phase 1 Winners
	\| Ablation \| Winner Trial \| LR \| val_loss@5K \| acc@5K \|
	\|----------\|-------------\|-----\|-------------\|--------\|
	\| mate-boost \| 17 \| 1e-3 \| 3.629 \| 5.2% \|
	\| no-outcome \| 19 \| 3e-4 \| 3.444 \| 5.8% \|
	\| discard-ply-limit \| 23 \| 1e-3 \| 3.617 \| 5.4% \|

	### 2026-04-03 07:10 UTC — Phase 2 check-in (30 min in)
	All 3 trials stable, approaching 10K eval. Grad norms low (<1.2), no anomalies.

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \| g/s \|
	\|-------\|----------\|------\|-----------\|------\|------\|-----\|
	\| 24 \| mate-boost \| 9600 \| 3.407 \| 5.8% \| 9.6e-4 \| 640 \|
	\| 25 \| no-outcome \| 9400 \| 3.276 \| 6.3% \| 3.0e-4 \| 634 \|
	\| 26 \| discard-ply-limit \| 9500 \| 3.399 \| 6.1% \| 9.5e-4 \| 652 \|

	- Trial 25 (no-outcome) leading — lowest loss, highest acc. Interesting given it uses 3x lower LR.
	- Trials 24 & 26 still in warmup (lr=9.5e-4, peak 1e-3 at step 10K). Should accelerate after warmup.
	- Cost: ~$6.30 (~2h45m elapsed). Budget on track.
	- Fixed `pause_after_steps` and `lab_resume` bugs in code.
	- HF bucket synced (metrics only, checkpoints excluded).

	### 2026-04-03 08:10 UTC — Phase 2 check-in (~1.5h in, ~9% done)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \| gn \|
	\|-------\|----------\|------\|-----------\|------\|------\|-----\|
	\| 24 \| mate-boost \| 18700 \| 3.283 \| 6.2% \| 9.95e-4 \| 0.22 \|
	\| 25 \| no-outcome \| 18300 \| 3.160 \| 6.7% \| 2.94e-4 \| 0.53 \|
	\| 26 \| discard-ply-limit \| 18700 \| 3.254 \| 6.9% \| 9.95e-4 \| 0.20 \|

	- All past warmup (peaked at 1e-3), now in cosine decay. Very stable.
	- Trial 25 (no-outcome) still lowest loss. Trial 26 (discard-ply-limit) best accuracy.
	- ~2.5 sps sustained. ETA unchanged: ~2026-04-04 04:30 UTC.

	### 2026-04-03 09:10 UTC — Phase 2 check-in (~2.5h in, ~14%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \| gn \|
	\|-------\|----------\|------\|-----------\|------\|------\|-----\|
	\| 24 \| mate-boost \| 27800 \| 3.242 \| 6.4% \| 9.81e-4 \| 0.13 \|
	\| 25 \| no-outcome \| 27200 \| 3.123 \| 7.1% \| 2.88e-4 \| 0.32 \|
	\| 26 \| discard-ply-limit \| 27800 \| 3.248 \| 6.8% \| 9.81e-4 \| 0.14 \|

	- Steady progress. Trial 25 still leading on loss; trial 26 competitive on accuracy.
	- Grad norms extremely low (0.12-0.32) — stable regime, no risk of divergence.

	### 2026-04-03 10:10 UTC — Phase 2 check-in (~3.5h in, ~18%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \|
	\|-------\|----------\|------\|-----------\|------\|------\|
	\| 24 \| mate-boost \| 37000 \| 3.246 \| 6.5% \| 9.56e-4 \|
	\| 25 \| no-outcome \| 36100 \| 3.133 \| 6.7% \| 2.79e-4 \|
	\| 26 \| discard-ply-limit \| 36900 \| 3.191 \| 7.4% \| 9.56e-4 \|

	- Trial 26 (discard-ply-limit) now best accuracy at 7.4%, overtaking trial 25.
	- Trial 25 still lowest loss but accuracy plateauing. Different distribution?
	- HF bucket synced (metrics + lab notes).

	### 2026-04-03 11:10 UTC — Phase 2 check-in (~4.5h in, ~23%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \|
	\|-------\|----------\|------\|-----------\|------\|------\|
	\| 24 \| mate-boost \| 46100 \| 3.213 \| 6.5% \| 9.22e-4 \|
	\| 25 \| no-outcome \| 44900 \| 3.115 \| 6.9% \| 2.68e-4 \|
	\| 26 \| discard-ply-limit \| 46000 \| 3.215 \| 7.1% \| 9.23e-4 \|

	- Stable. Trial 26 still best accuracy, trial 25 still lowest loss.
	- All losses still gradually decreasing — no signs of plateau yet.

	### 2026-04-03 20:10 UTC — Phase 2 check-in (~13.5h in, ~65%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \|
	\|-------\|----------\|------\|-----------\|------\|------\|
	\| 24 \| mate-boost \| 129300 \| 3.200 \| 6.5% \| 3.74e-4 \|
	\| 25 \| no-outcome \| 125000 \| 3.078 \| 6.9% \| 1.13e-4 \|
	\| 26 \| discard-ply-limit \| 128500 \| 3.193 \| 7.6% \| 3.80e-4 \|

	- Trial 25 val eval at 125K: val_loss=3.097, acc=6.8%, top5=27.8%, legal=99.6%
	- Trial 26 train acc now at 7.6-7.8% — clearly best accuracy.
	- Accuracy ceiling computation running concurrently on CPUs (~65% done).
	- ETA for training: ~04:30 UTC. ETA for ceiling: ~22:00 UTC.

	### 2026-04-04 01:10 UTC — Phase 2 check-in (~18.5h in, ~87%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \|
	\|-------\|----------\|------\|-----------\|------\|------\|
	\| 24 \| mate-boost \| 174500 \| 3.170 \| 6.7% \| 1.39e-4 \|
	\| 25 \| no-outcome \| 169800 \| 3.082 \| 7.2% \| 4.49e-5 \|
	\| 26 \| discard-ply-limit \| 173600 \| 3.171 \| 7.6% \| 1.42e-4 \|

	- ~26K steps remaining, ETA ~04:00 UTC.
	- Ceiling computation DONE (1024 rollouts, 88K positions, 5.7h):
	- Unconditional: 6.43% [6.36, 6.50]
	- MC corrected: 6.68% [6.60, 6.75]
	- MC naive: 6.89% [6.81, 6.96]
	- Bracket: 0.21pp (was 0.66pp at 128 rollouts)
	- PAWN-Base (6.90%) at 103% of corrected ceiling — essentially at theoretical max.
	- Results saved to /workspace/data/theoretical_ceiling_1024.json and synced to HF.

	### 2026-04-04 02:10 UTC — Trial 24 NaN! Probes started.

	- Trial 24 (mate-boost) hit NaN between step 175K-180K. Killed at 184K.
	- Last good checkpoint: step 175K, val_loss=3.1860, acc=6.59%
	- Loss was flat (3.189→3.186 over last 20K steps) — 175K is effectively final.
	- Possible cause: bfloat16 AMP instability. Mate-boost games are shorter (~134 ply), concentrating loss on fewer tokens per batch.
	- Trials 25/26 healthy at step ~179K/183K.
	- Started post-training for trial 24:
	- Uploading 175K checkpoint to HF bucket (background)
	- Running linear probes on GPU 0 (background)
	- GPU 0 now doing probes. GPUs 1/2 still training.

	### 2026-04-04 04:10 UTC — Trial 26 COMPLETE. Trial 25 finishing.

	- Trial 26 (discard-ply-limit) COMPLETED at 200K steps:
	- val_loss=3.147, acc=7.85%, top5=27.8%, legal=99.4%
	- Best accuracy of all 3 ablations
	- Trial 25 (no-outcome) at step 196K — ~25 min remaining.
	- Started post-training for trial 26: checkpoint upload + probes on GPU 2.
	- Trial 24 probes still running on GPU 0.

	## Final Results (as trials complete)

	\| Trial \| Ablation \| Steps \| Best val_loss \| Best acc \| Status \|
	\|-------\|----------\|-------\|--------------\|----------\|--------\|
	\| 24 \| mate-boost \| 175K (NaN@177K) \| 3.186 \| 6.59% \| done (NaN) \|
	\| 25 \| no-outcome \| 196K/200K \| TBD \| TBD \| running \|
	\| 26 \| discard-ply-limit \| 200K \| 3.147 \| 7.85% \| done \|
	\| 25 \| no-outcome \| 200K \| 3.089 \| 6.83% \| done \|
	\| 26 \| discard-ply-limit \| 200K \| 3.147 \| 7.85% \| done \|
	\| — \| baseline (pawn-base) \| 100K \| ~3.06 \| ~6.90% \| reference \|

	### 2026-04-04 04:45 UTC — ALL TRAINING COMPLETE

	- Trial 25 (no-outcome) completed at 200K steps: val_loss=3.089, acc=6.83%
	- All GPUs idle. Uploading final checkpoints to HF.
	- Probes skipped (performance issues — runs on CPU, too slow on pod).
	- Drafting final report.

	### 2026-04-03 12:10 UTC — Phase 2 check-in (~5.5h in, ~28%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \|
	\|-------\|----------\|------\|-----------\|------\|------\|
	\| 24 \| mate-boost \| 55300 \| 3.237 \| 6.2% \| 8.80e-4 \|
	\| 25 \| no-outcome \| 53800 \| 3.122 \| 6.8% \| 2.55e-4 \|
	\| 26 \| discard-ply-limit \| 55200 \| 3.214 \| 7.0% \| 8.80e-4 \|

	- Cruising. Loss improvement slowing (expected as cosine decay kicks in).
	- Relative rankings unchanged: T25 best loss, T26 best acc, T24 trailing slightly.

	### 2026-04-03 14:10 UTC — Phase 2 check-in (~7.5h in, ~37%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \|
	\|-------\|----------\|------\|-----------\|------\|------\|
	\| 24 \| mate-boost \| 73600 \| 3.204 \| 6.6% \| 7.73e-4 \|
	\| 25 \| no-outcome \| 71600 \| 3.094 \| 7.0% \| 2.23e-4 \|
	\| 26 \| discard-ply-limit \| 73500 \| 3.192 \| 7.5% \| 7.74e-4 \|

	- Steady. Trial 26 acc now at 7.5%. All losses still slowly decreasing.
	- HF bucket synced (metrics + lab notes + transcript).

	### 2026-04-03 17:10 UTC — Phase 2 HALFWAY (~10.5h in, ~50%)

	\| Trial \| Ablation \| Step \| train_loss \| acc \| LR \|
	\|-------\|----------\|------\|-----------\|------\|------\|
	\| 24 \| mate-boost \| 101600 \| 3.206 \| 6.5% \| 5.75e-4 \|
	\| 25 \| no-outcome \| 98300 \| 3.090 \| 6.9% \| 1.69e-4 \|
	\| 26 \| discard-ply-limit \| 101000 \| 3.179 \| 7.4% \| 5.80e-4 \|

	- 100K milestone reached. This is where baseline PAWN-Base stopped.
	- Baseline reference: PAWN-Base val_loss ~3.06 at 100K steps.
	- Trial 25 (no-outcome) at 3.09 loss — nearly matching baseline despite no outcome token.
	- Trial 26 (discard-ply-limit) best accuracy at 7.4%.
	- LR decaying steadily (cosine). Trials 24/26 at ~5.8e-4, trial 25 at ~1.7e-4.
	- ETA unchanged: ~2026-04-04 04:30 UTC. Cost so far: ~$26 (~11.3h × $2.31).

Xet Storage Details

Size:: 12.6 kB
Xet hash:: b1c447b86e7834339925d889251a0a3dedacab8f79b5d138635bea8afab1cb1a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.