Buckets:
| # Pretraining Ablations Lab Notes | |
| ## Experiment: mate-boost, no-outcome, discard-ply-limit | |
| Started: 2026-04-03 | |
| Pod: 3x RTX 6000 Ada (48GB each), $2.31/hr | |
| ## Phase 1: LR Exploration (5K steps each, no_compile) | |
| ### Batch 1: mate-boost (DONE) | |
| | Trial | LR | val_loss | val_acc | top5 | ppl | legal% | Note | | |
| |-------|----|---------|---------|------|-----|--------|------| | |
| | 15 | 1e-4 | 4.712 | 3.6% | 15.5% | 111.3 | 73.7% | Dominated, still in warmup | | |
| | 16 | 3e-4 | 4.019 | 4.4% | 19.3% | 55.6 | 89.1% | Solid | | |
| | **17** | **1e-3** | **3.629** | **5.2%** | **21.6%** | **37.7** | **93.3%** | **Winner** | | |
| ### Batch 2: no-outcome (DONE) | |
| | Trial | LR | Warmup | val_loss | val_acc | top5 | ppl | legal% | Note | | |
| |-------|----|--------|---------|---------|------|-----|--------|------| | |
| | 18 | 1e-4 | 5% | 4.585 | 4.3% | 18.4% | 98.0 | 76.6% | Dominated, still in warmup | | |
| | **19** | **3e-4** | **0%** | **3.444** | **5.8%** | **24.3%** | **31.3** | **94.2%** | **Winner** (no-warmup) | | |
| | 20 | 1e-3 | 5% | 3.529 | 5.6% | 24.3% | 34.1 | 94.5% | Close second, still warming | | |
| ### Batch 3: discard-ply-limit (DONE) | |
| | Trial | LR | val_loss | val_acc | top5 | ppl | legal% | Note | | |
| |-------|----|---------|---------|------|-----|--------|------| | |
| | 21 | 1e-4 | 4.743 | 3.7% | 16.3% | 114.7 | 72.8% | Dominated, still in warmup | | |
| | 22 | 3e-4 | 4.054 | 4.6% | 20.1% | 57.7 | 87.4% | Middle ground | | |
| | **23** | **1e-3** | **3.617** | **5.4%** | **22.6%** | **37.2** | **93.2%** | **Winner** | | |
| ## Phase 2: Full Training (200K steps, torch.compile ON) | |
| | Trial | Ablation | LR | Resumed From | GPU | Status | | |
| |-------|----------|-----|-------------|-----|--------| | |
| | 24 | mate-boost | 1e-3 | trial 17 @ 5K | 0 | RUNNING | | |
| | 25 | no-outcome | 3e-4 | trial 19 @ 5K | 1 | RUNNING | | |
| | 26 | discard-ply-limit | 1e-3 | trial 23 @ 5K | 2 | RUNNING | | |
| torch.compile will take 15-30 min to compile. After that, expect ~2.5 sps → ~22h for 195K steps. | |
| ETA for completion: ~2026-04-04 04:30 UTC. | |
| ## Known Issues | |
| - **pause_after_steps not working**: Trials 15-17 continued past 5K steps. Had to kill manually. Will need to kill batch 2 manually too after 5K eval. | |
| ## Log | |
| ### 2026-04-03 04:38 UTC | |
| - Disabled MPS (was funneling all trials to GPU 0) | |
| - Removed sdpa_math from configs (NVIDIA GPUs use flash attention) | |
| - Launched batch 1: mate-boost x3 LRs | |
| ### 2026-04-03 05:22 UTC | |
| - Batch 1 complete. lr=1e-3 dominates across all metrics. | |
| - pause_after_steps bug: trials continued past 5K, had to kill at ~5200. | |
| - Killed batch 1, launched batch 2: no-outcome x3 LRs. | |
| - Trial 19 (lr=3e-4) uses warmup_frac=0 to test no-warmup per user request. | |
| - ETA for batch 2 eval: ~05:57 UTC | |
| ### 2026-04-03 05:50 UTC | |
| - New session picked up. Batch 2 (no-outcome) running: trials 18/19/20 at step ~4000. | |
| - Trial 18 (lr=1e-4): step 4000, train_loss=4.876, still in warmup (lr=4e-5), val_loss@2.5K=5.455 | |
| - Trial 19 (lr=3e-4, no-warmup): step 3900, train_loss=3.514, lr=3e-4 (flat), val_loss@2.5K=3.710 | |
| - Trial 20 (lr=1e-3): step 3900, train_loss=3.700, still warming (lr=3.9e-4), val_loss@2.5K=4.102 | |
| - Set up 5-min cron to catch 5K eval, kill trials, launch batch 3. | |
| - Cost so far: $3.36 (1h27m elapsed) | |
| ### 2026-04-03 05:57 UTC | |
| - Batch 2 (no-outcome) complete at 5K steps. Results: | |
| - lr=1e-4: val_loss=4.585 (dominated, warmup too slow) | |
| - lr=3e-4 (no warmup): val_loss=3.444, acc=5.8% — WINNER | |
| - lr=1e-3: val_loss=3.529 — close second but still warming up | |
| - Killed all 3, launched batch 3 (discard-ply-limit): trials 21/22/23 | |
| - **Note**: discard-ply-limit discards ~60% of games. Watch step times for slowdown. | |
| - ETA for batch 3 eval: ~06:30 UTC | |
| ### 2026-04-03 06:39 UTC | |
| - Batch 3 (discard-ply-limit) complete at 5K steps. Results: | |
| - lr=1e-4: val_loss=4.743 (dominated) | |
| - lr=3e-4: val_loss=4.054 (middle) | |
| - lr=1e-3: val_loss=3.617, acc=5.4% — WINNER | |
| - **Phase 1 COMPLETE.** All 3 ablations: lr=1e-3 wins for mate-boost and discard-ply-limit, lr=3e-4 wins for no-outcome. | |
| - Interesting: no-outcome at 3e-4 (val_loss=3.444) beats both other ablations at 1e-3 at 5K steps. | |
| - Launched Phase 2: trials 24/25/26 resuming from 5K checkpoints, torch.compile ON. | |
| - Switched cron from 5-min to hourly (long runs, ~22h remaining). | |
| - Note: 1e-4 consistently dominated across ALL ablations. 5% warmup on 200K steps means lr only reaches 5e-5 by step 5K. Consider shorter warmup for future experiments. | |
| ## Phase 1 Winners | |
| | Ablation | Winner Trial | LR | val_loss@5K | acc@5K | | |
| |----------|-------------|-----|-------------|--------| | |
| | mate-boost | 17 | 1e-3 | 3.629 | 5.2% | | |
| | no-outcome | 19 | 3e-4 | 3.444 | 5.8% | | |
| | discard-ply-limit | 23 | 1e-3 | 3.617 | 5.4% | | |
| ### 2026-04-03 07:10 UTC — Phase 2 check-in (30 min in) | |
| All 3 trials stable, approaching 10K eval. Grad norms low (<1.2), no anomalies. | |
| | Trial | Ablation | Step | train_loss | acc | LR | g/s | | |
| |-------|----------|------|-----------|------|------|-----| | |
| | 24 | mate-boost | 9600 | 3.407 | 5.8% | 9.6e-4 | 640 | | |
| | 25 | no-outcome | 9400 | 3.276 | 6.3% | 3.0e-4 | 634 | | |
| | 26 | discard-ply-limit | 9500 | 3.399 | 6.1% | 9.5e-4 | 652 | | |
| - Trial 25 (no-outcome) leading — lowest loss, highest acc. Interesting given it uses 3x lower LR. | |
| - Trials 24 & 26 still in warmup (lr=9.5e-4, peak 1e-3 at step 10K). Should accelerate after warmup. | |
| - Cost: ~$6.30 (~2h45m elapsed). Budget on track. | |
| - Fixed `pause_after_steps` and `lab_resume` bugs in code. | |
| - HF bucket synced (metrics only, checkpoints excluded). | |
| ### 2026-04-03 08:10 UTC — Phase 2 check-in (~1.5h in, ~9% done) | |
| | Trial | Ablation | Step | train_loss | acc | LR | gn | | |
| |-------|----------|------|-----------|------|------|-----| | |
| | 24 | mate-boost | 18700 | 3.283 | 6.2% | 9.95e-4 | 0.22 | | |
| | 25 | no-outcome | 18300 | 3.160 | 6.7% | 2.94e-4 | 0.53 | | |
| | 26 | discard-ply-limit | 18700 | 3.254 | 6.9% | 9.95e-4 | 0.20 | | |
| - All past warmup (peaked at 1e-3), now in cosine decay. Very stable. | |
| - Trial 25 (no-outcome) still lowest loss. Trial 26 (discard-ply-limit) best accuracy. | |
| - ~2.5 sps sustained. ETA unchanged: ~2026-04-04 04:30 UTC. | |
| ### 2026-04-03 09:10 UTC — Phase 2 check-in (~2.5h in, ~14%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | gn | | |
| |-------|----------|------|-----------|------|------|-----| | |
| | 24 | mate-boost | 27800 | 3.242 | 6.4% | 9.81e-4 | 0.13 | | |
| | 25 | no-outcome | 27200 | 3.123 | 7.1% | 2.88e-4 | 0.32 | | |
| | 26 | discard-ply-limit | 27800 | 3.248 | 6.8% | 9.81e-4 | 0.14 | | |
| - Steady progress. Trial 25 still leading on loss; trial 26 competitive on accuracy. | |
| - Grad norms extremely low (0.12-0.32) — stable regime, no risk of divergence. | |
| ### 2026-04-03 10:10 UTC — Phase 2 check-in (~3.5h in, ~18%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | | |
| |-------|----------|------|-----------|------|------| | |
| | 24 | mate-boost | 37000 | 3.246 | 6.5% | 9.56e-4 | | |
| | 25 | no-outcome | 36100 | 3.133 | 6.7% | 2.79e-4 | | |
| | 26 | discard-ply-limit | 36900 | 3.191 | 7.4% | 9.56e-4 | | |
| - Trial 26 (discard-ply-limit) now best accuracy at 7.4%, overtaking trial 25. | |
| - Trial 25 still lowest loss but accuracy plateauing. Different distribution? | |
| - HF bucket synced (metrics + lab notes). | |
| ### 2026-04-03 11:10 UTC — Phase 2 check-in (~4.5h in, ~23%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | | |
| |-------|----------|------|-----------|------|------| | |
| | 24 | mate-boost | 46100 | 3.213 | 6.5% | 9.22e-4 | | |
| | 25 | no-outcome | 44900 | 3.115 | 6.9% | 2.68e-4 | | |
| | 26 | discard-ply-limit | 46000 | 3.215 | 7.1% | 9.23e-4 | | |
| - Stable. Trial 26 still best accuracy, trial 25 still lowest loss. | |
| - All losses still gradually decreasing — no signs of plateau yet. | |
| ### 2026-04-03 20:10 UTC — Phase 2 check-in (~13.5h in, ~65%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | | |
| |-------|----------|------|-----------|------|------| | |
| | 24 | mate-boost | 129300 | 3.200 | 6.5% | 3.74e-4 | | |
| | 25 | no-outcome | 125000 | 3.078 | 6.9% | 1.13e-4 | | |
| | 26 | discard-ply-limit | 128500 | 3.193 | 7.6% | 3.80e-4 | | |
| - Trial 25 val eval at 125K: val_loss=3.097, acc=6.8%, top5=27.8%, legal=99.6% | |
| - Trial 26 train acc now at 7.6-7.8% — clearly best accuracy. | |
| - Accuracy ceiling computation running concurrently on CPUs (~65% done). | |
| - ETA for training: ~04:30 UTC. ETA for ceiling: ~22:00 UTC. | |
| ### 2026-04-04 01:10 UTC — Phase 2 check-in (~18.5h in, ~87%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | | |
| |-------|----------|------|-----------|------|------| | |
| | 24 | mate-boost | 174500 | 3.170 | 6.7% | 1.39e-4 | | |
| | 25 | no-outcome | 169800 | 3.082 | 7.2% | 4.49e-5 | | |
| | 26 | discard-ply-limit | 173600 | 3.171 | 7.6% | 1.42e-4 | | |
| - ~26K steps remaining, ETA ~04:00 UTC. | |
| - **Ceiling computation DONE** (1024 rollouts, 88K positions, 5.7h): | |
| - Unconditional: 6.43% [6.36, 6.50] | |
| - MC corrected: 6.68% [6.60, 6.75] | |
| - MC naive: 6.89% [6.81, 6.96] | |
| - Bracket: 0.21pp (was 0.66pp at 128 rollouts) | |
| - PAWN-Base (6.90%) at 103% of corrected ceiling — essentially at theoretical max. | |
| - Results saved to /workspace/data/theoretical_ceiling_1024.json and synced to HF. | |
| ### 2026-04-04 02:10 UTC — Trial 24 NaN! Probes started. | |
| - **Trial 24 (mate-boost) hit NaN** between step 175K-180K. Killed at 184K. | |
| - Last good checkpoint: step 175K, val_loss=3.1860, acc=6.59% | |
| - Loss was flat (3.189→3.186 over last 20K steps) — 175K is effectively final. | |
| - Possible cause: bfloat16 AMP instability. Mate-boost games are shorter (~134 ply), concentrating loss on fewer tokens per batch. | |
| - Trials 25/26 healthy at step ~179K/183K. | |
| - **Started post-training for trial 24:** | |
| - Uploading 175K checkpoint to HF bucket (background) | |
| - Running linear probes on GPU 0 (background) | |
| - GPU 0 now doing probes. GPUs 1/2 still training. | |
| ### 2026-04-04 04:10 UTC — Trial 26 COMPLETE. Trial 25 finishing. | |
| - **Trial 26 (discard-ply-limit) COMPLETED** at 200K steps: | |
| - val_loss=3.147, acc=7.85%, top5=27.8%, legal=99.4% | |
| - Best accuracy of all 3 ablations | |
| - Trial 25 (no-outcome) at step 196K — ~25 min remaining. | |
| - Started post-training for trial 26: checkpoint upload + probes on GPU 2. | |
| - Trial 24 probes still running on GPU 0. | |
| ## Final Results (as trials complete) | |
| | Trial | Ablation | Steps | Best val_loss | Best acc | Status | | |
| |-------|----------|-------|--------------|----------|--------| | |
| | 24 | mate-boost | 175K (NaN@177K) | 3.186 | 6.59% | done (NaN) | | |
| | 25 | no-outcome | 196K/200K | TBD | TBD | running | | |
| | 26 | discard-ply-limit | 200K | 3.147 | 7.85% | done | | |
| | 25 | no-outcome | 200K | 3.089 | 6.83% | done | | |
| | 26 | discard-ply-limit | 200K | 3.147 | 7.85% | done | | |
| | — | baseline (pawn-base) | 100K | ~3.06 | ~6.90% | reference | | |
| ### 2026-04-04 04:45 UTC — ALL TRAINING COMPLETE | |
| - Trial 25 (no-outcome) completed at 200K steps: val_loss=3.089, acc=6.83% | |
| - All GPUs idle. Uploading final checkpoints to HF. | |
| - Probes skipped (performance issues — runs on CPU, too slow on pod). | |
| - Drafting final report. | |
| ### 2026-04-03 12:10 UTC — Phase 2 check-in (~5.5h in, ~28%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | | |
| |-------|----------|------|-----------|------|------| | |
| | 24 | mate-boost | 55300 | 3.237 | 6.2% | 8.80e-4 | | |
| | 25 | no-outcome | 53800 | 3.122 | 6.8% | 2.55e-4 | | |
| | 26 | discard-ply-limit | 55200 | 3.214 | 7.0% | 8.80e-4 | | |
| - Cruising. Loss improvement slowing (expected as cosine decay kicks in). | |
| - Relative rankings unchanged: T25 best loss, T26 best acc, T24 trailing slightly. | |
| ### 2026-04-03 14:10 UTC — Phase 2 check-in (~7.5h in, ~37%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | | |
| |-------|----------|------|-----------|------|------| | |
| | 24 | mate-boost | 73600 | 3.204 | 6.6% | 7.73e-4 | | |
| | 25 | no-outcome | 71600 | 3.094 | 7.0% | 2.23e-4 | | |
| | 26 | discard-ply-limit | 73500 | 3.192 | 7.5% | 7.74e-4 | | |
| - Steady. Trial 26 acc now at 7.5%. All losses still slowly decreasing. | |
| - HF bucket synced (metrics + lab notes + transcript). | |
| ### 2026-04-03 17:10 UTC — Phase 2 HALFWAY (~10.5h in, ~50%) | |
| | Trial | Ablation | Step | train_loss | acc | LR | | |
| |-------|----------|------|-----------|------|------| | |
| | 24 | mate-boost | 101600 | 3.206 | 6.5% | 5.75e-4 | | |
| | 25 | no-outcome | 98300 | 3.090 | 6.9% | 1.69e-4 | | |
| | 26 | discard-ply-limit | 101000 | 3.179 | 7.4% | 5.80e-4 | | |
| - **100K milestone reached.** This is where baseline PAWN-Base stopped. | |
| - Baseline reference: PAWN-Base val_loss ~3.06 at 100K steps. | |
| - Trial 25 (no-outcome) at 3.09 loss — nearly matching baseline despite no outcome token. | |
| - Trial 26 (discard-ply-limit) best accuracy at 7.4%. | |
| - LR decaying steadily (cosine). Trials 24/26 at ~5.8e-4, trial 25 at ~1.7e-4. | |
| - ETA unchanged: ~2026-04-04 04:30 UTC. Cost so far: ~$26 (~11.3h × $2.31). | |
Xet Storage Details
- Size:
- 12.6 kB
- Xet hash:
- b1c447b86e7834339925d889251a0a3dedacab8f79b5d138635bea8afab1cb1a
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.