thomas-schweich's picture
|
download
raw
12.6 kB
# Pretraining Ablations Lab Notes
## Experiment: mate-boost, no-outcome, discard-ply-limit
Started: 2026-04-03
Pod: 3x RTX 6000 Ada (48GB each), $2.31/hr
## Phase 1: LR Exploration (5K steps each, no_compile)
### Batch 1: mate-boost (DONE)
| Trial | LR | val_loss | val_acc | top5 | ppl | legal% | Note |
|-------|----|---------|---------|------|-----|--------|------|
| 15 | 1e-4 | 4.712 | 3.6% | 15.5% | 111.3 | 73.7% | Dominated, still in warmup |
| 16 | 3e-4 | 4.019 | 4.4% | 19.3% | 55.6 | 89.1% | Solid |
| **17** | **1e-3** | **3.629** | **5.2%** | **21.6%** | **37.7** | **93.3%** | **Winner** |
### Batch 2: no-outcome (DONE)
| Trial | LR | Warmup | val_loss | val_acc | top5 | ppl | legal% | Note |
|-------|----|--------|---------|---------|------|-----|--------|------|
| 18 | 1e-4 | 5% | 4.585 | 4.3% | 18.4% | 98.0 | 76.6% | Dominated, still in warmup |
| **19** | **3e-4** | **0%** | **3.444** | **5.8%** | **24.3%** | **31.3** | **94.2%** | **Winner** (no-warmup) |
| 20 | 1e-3 | 5% | 3.529 | 5.6% | 24.3% | 34.1 | 94.5% | Close second, still warming |
### Batch 3: discard-ply-limit (DONE)
| Trial | LR | val_loss | val_acc | top5 | ppl | legal% | Note |
|-------|----|---------|---------|------|-----|--------|------|
| 21 | 1e-4 | 4.743 | 3.7% | 16.3% | 114.7 | 72.8% | Dominated, still in warmup |
| 22 | 3e-4 | 4.054 | 4.6% | 20.1% | 57.7 | 87.4% | Middle ground |
| **23** | **1e-3** | **3.617** | **5.4%** | **22.6%** | **37.2** | **93.2%** | **Winner** |
## Phase 2: Full Training (200K steps, torch.compile ON)
| Trial | Ablation | LR | Resumed From | GPU | Status |
|-------|----------|-----|-------------|-----|--------|
| 24 | mate-boost | 1e-3 | trial 17 @ 5K | 0 | RUNNING |
| 25 | no-outcome | 3e-4 | trial 19 @ 5K | 1 | RUNNING |
| 26 | discard-ply-limit | 1e-3 | trial 23 @ 5K | 2 | RUNNING |
torch.compile will take 15-30 min to compile. After that, expect ~2.5 sps → ~22h for 195K steps.
ETA for completion: ~2026-04-04 04:30 UTC.
## Known Issues
- **pause_after_steps not working**: Trials 15-17 continued past 5K steps. Had to kill manually. Will need to kill batch 2 manually too after 5K eval.
## Log
### 2026-04-03 04:38 UTC
- Disabled MPS (was funneling all trials to GPU 0)
- Removed sdpa_math from configs (NVIDIA GPUs use flash attention)
- Launched batch 1: mate-boost x3 LRs
### 2026-04-03 05:22 UTC
- Batch 1 complete. lr=1e-3 dominates across all metrics.
- pause_after_steps bug: trials continued past 5K, had to kill at ~5200.
- Killed batch 1, launched batch 2: no-outcome x3 LRs.
- Trial 19 (lr=3e-4) uses warmup_frac=0 to test no-warmup per user request.
- ETA for batch 2 eval: ~05:57 UTC
### 2026-04-03 05:50 UTC
- New session picked up. Batch 2 (no-outcome) running: trials 18/19/20 at step ~4000.
- Trial 18 (lr=1e-4): step 4000, train_loss=4.876, still in warmup (lr=4e-5), val_loss@2.5K=5.455
- Trial 19 (lr=3e-4, no-warmup): step 3900, train_loss=3.514, lr=3e-4 (flat), val_loss@2.5K=3.710
- Trial 20 (lr=1e-3): step 3900, train_loss=3.700, still warming (lr=3.9e-4), val_loss@2.5K=4.102
- Set up 5-min cron to catch 5K eval, kill trials, launch batch 3.
- Cost so far: $3.36 (1h27m elapsed)
### 2026-04-03 05:57 UTC
- Batch 2 (no-outcome) complete at 5K steps. Results:
- lr=1e-4: val_loss=4.585 (dominated, warmup too slow)
- lr=3e-4 (no warmup): val_loss=3.444, acc=5.8% — WINNER
- lr=1e-3: val_loss=3.529 — close second but still warming up
- Killed all 3, launched batch 3 (discard-ply-limit): trials 21/22/23
- **Note**: discard-ply-limit discards ~60% of games. Watch step times for slowdown.
- ETA for batch 3 eval: ~06:30 UTC
### 2026-04-03 06:39 UTC
- Batch 3 (discard-ply-limit) complete at 5K steps. Results:
- lr=1e-4: val_loss=4.743 (dominated)
- lr=3e-4: val_loss=4.054 (middle)
- lr=1e-3: val_loss=3.617, acc=5.4% — WINNER
- **Phase 1 COMPLETE.** All 3 ablations: lr=1e-3 wins for mate-boost and discard-ply-limit, lr=3e-4 wins for no-outcome.
- Interesting: no-outcome at 3e-4 (val_loss=3.444) beats both other ablations at 1e-3 at 5K steps.
- Launched Phase 2: trials 24/25/26 resuming from 5K checkpoints, torch.compile ON.
- Switched cron from 5-min to hourly (long runs, ~22h remaining).
- Note: 1e-4 consistently dominated across ALL ablations. 5% warmup on 200K steps means lr only reaches 5e-5 by step 5K. Consider shorter warmup for future experiments.
## Phase 1 Winners
| Ablation | Winner Trial | LR | val_loss@5K | acc@5K |
|----------|-------------|-----|-------------|--------|
| mate-boost | 17 | 1e-3 | 3.629 | 5.2% |
| no-outcome | 19 | 3e-4 | 3.444 | 5.8% |
| discard-ply-limit | 23 | 1e-3 | 3.617 | 5.4% |
### 2026-04-03 07:10 UTC — Phase 2 check-in (30 min in)
All 3 trials stable, approaching 10K eval. Grad norms low (<1.2), no anomalies.
| Trial | Ablation | Step | train_loss | acc | LR | g/s |
|-------|----------|------|-----------|------|------|-----|
| 24 | mate-boost | 9600 | 3.407 | 5.8% | 9.6e-4 | 640 |
| 25 | no-outcome | 9400 | 3.276 | 6.3% | 3.0e-4 | 634 |
| 26 | discard-ply-limit | 9500 | 3.399 | 6.1% | 9.5e-4 | 652 |
- Trial 25 (no-outcome) leading — lowest loss, highest acc. Interesting given it uses 3x lower LR.
- Trials 24 & 26 still in warmup (lr=9.5e-4, peak 1e-3 at step 10K). Should accelerate after warmup.
- Cost: ~$6.30 (~2h45m elapsed). Budget on track.
- Fixed `pause_after_steps` and `lab_resume` bugs in code.
- HF bucket synced (metrics only, checkpoints excluded).
### 2026-04-03 08:10 UTC — Phase 2 check-in (~1.5h in, ~9% done)
| Trial | Ablation | Step | train_loss | acc | LR | gn |
|-------|----------|------|-----------|------|------|-----|
| 24 | mate-boost | 18700 | 3.283 | 6.2% | 9.95e-4 | 0.22 |
| 25 | no-outcome | 18300 | 3.160 | 6.7% | 2.94e-4 | 0.53 |
| 26 | discard-ply-limit | 18700 | 3.254 | 6.9% | 9.95e-4 | 0.20 |
- All past warmup (peaked at 1e-3), now in cosine decay. Very stable.
- Trial 25 (no-outcome) still lowest loss. Trial 26 (discard-ply-limit) best accuracy.
- ~2.5 sps sustained. ETA unchanged: ~2026-04-04 04:30 UTC.
### 2026-04-03 09:10 UTC — Phase 2 check-in (~2.5h in, ~14%)
| Trial | Ablation | Step | train_loss | acc | LR | gn |
|-------|----------|------|-----------|------|------|-----|
| 24 | mate-boost | 27800 | 3.242 | 6.4% | 9.81e-4 | 0.13 |
| 25 | no-outcome | 27200 | 3.123 | 7.1% | 2.88e-4 | 0.32 |
| 26 | discard-ply-limit | 27800 | 3.248 | 6.8% | 9.81e-4 | 0.14 |
- Steady progress. Trial 25 still leading on loss; trial 26 competitive on accuracy.
- Grad norms extremely low (0.12-0.32) — stable regime, no risk of divergence.
### 2026-04-03 10:10 UTC — Phase 2 check-in (~3.5h in, ~18%)
| Trial | Ablation | Step | train_loss | acc | LR |
|-------|----------|------|-----------|------|------|
| 24 | mate-boost | 37000 | 3.246 | 6.5% | 9.56e-4 |
| 25 | no-outcome | 36100 | 3.133 | 6.7% | 2.79e-4 |
| 26 | discard-ply-limit | 36900 | 3.191 | 7.4% | 9.56e-4 |
- Trial 26 (discard-ply-limit) now best accuracy at 7.4%, overtaking trial 25.
- Trial 25 still lowest loss but accuracy plateauing. Different distribution?
- HF bucket synced (metrics + lab notes).
### 2026-04-03 11:10 UTC — Phase 2 check-in (~4.5h in, ~23%)
| Trial | Ablation | Step | train_loss | acc | LR |
|-------|----------|------|-----------|------|------|
| 24 | mate-boost | 46100 | 3.213 | 6.5% | 9.22e-4 |
| 25 | no-outcome | 44900 | 3.115 | 6.9% | 2.68e-4 |
| 26 | discard-ply-limit | 46000 | 3.215 | 7.1% | 9.23e-4 |
- Stable. Trial 26 still best accuracy, trial 25 still lowest loss.
- All losses still gradually decreasing — no signs of plateau yet.
### 2026-04-03 20:10 UTC — Phase 2 check-in (~13.5h in, ~65%)
| Trial | Ablation | Step | train_loss | acc | LR |
|-------|----------|------|-----------|------|------|
| 24 | mate-boost | 129300 | 3.200 | 6.5% | 3.74e-4 |
| 25 | no-outcome | 125000 | 3.078 | 6.9% | 1.13e-4 |
| 26 | discard-ply-limit | 128500 | 3.193 | 7.6% | 3.80e-4 |
- Trial 25 val eval at 125K: val_loss=3.097, acc=6.8%, top5=27.8%, legal=99.6%
- Trial 26 train acc now at 7.6-7.8% — clearly best accuracy.
- Accuracy ceiling computation running concurrently on CPUs (~65% done).
- ETA for training: ~04:30 UTC. ETA for ceiling: ~22:00 UTC.
### 2026-04-04 01:10 UTC — Phase 2 check-in (~18.5h in, ~87%)
| Trial | Ablation | Step | train_loss | acc | LR |
|-------|----------|------|-----------|------|------|
| 24 | mate-boost | 174500 | 3.170 | 6.7% | 1.39e-4 |
| 25 | no-outcome | 169800 | 3.082 | 7.2% | 4.49e-5 |
| 26 | discard-ply-limit | 173600 | 3.171 | 7.6% | 1.42e-4 |
- ~26K steps remaining, ETA ~04:00 UTC.
- **Ceiling computation DONE** (1024 rollouts, 88K positions, 5.7h):
- Unconditional: 6.43% [6.36, 6.50]
- MC corrected: 6.68% [6.60, 6.75]
- MC naive: 6.89% [6.81, 6.96]
- Bracket: 0.21pp (was 0.66pp at 128 rollouts)
- PAWN-Base (6.90%) at 103% of corrected ceiling — essentially at theoretical max.
- Results saved to /workspace/data/theoretical_ceiling_1024.json and synced to HF.
### 2026-04-04 02:10 UTC — Trial 24 NaN! Probes started.
- **Trial 24 (mate-boost) hit NaN** between step 175K-180K. Killed at 184K.
- Last good checkpoint: step 175K, val_loss=3.1860, acc=6.59%
- Loss was flat (3.189→3.186 over last 20K steps) — 175K is effectively final.
- Possible cause: bfloat16 AMP instability. Mate-boost games are shorter (~134 ply), concentrating loss on fewer tokens per batch.
- Trials 25/26 healthy at step ~179K/183K.
- **Started post-training for trial 24:**
- Uploading 175K checkpoint to HF bucket (background)
- Running linear probes on GPU 0 (background)
- GPU 0 now doing probes. GPUs 1/2 still training.
### 2026-04-04 04:10 UTC — Trial 26 COMPLETE. Trial 25 finishing.
- **Trial 26 (discard-ply-limit) COMPLETED** at 200K steps:
- val_loss=3.147, acc=7.85%, top5=27.8%, legal=99.4%
- Best accuracy of all 3 ablations
- Trial 25 (no-outcome) at step 196K — ~25 min remaining.
- Started post-training for trial 26: checkpoint upload + probes on GPU 2.
- Trial 24 probes still running on GPU 0.
## Final Results (as trials complete)
| Trial | Ablation | Steps | Best val_loss | Best acc | Status |
|-------|----------|-------|--------------|----------|--------|
| 24 | mate-boost | 175K (NaN@177K) | 3.186 | 6.59% | done (NaN) |
| 25 | no-outcome | 196K/200K | TBD | TBD | running |
| 26 | discard-ply-limit | 200K | 3.147 | 7.85% | done |
| 25 | no-outcome | 200K | 3.089 | 6.83% | done |
| 26 | discard-ply-limit | 200K | 3.147 | 7.85% | done |
| — | baseline (pawn-base) | 100K | ~3.06 | ~6.90% | reference |
### 2026-04-04 04:45 UTC — ALL TRAINING COMPLETE
- Trial 25 (no-outcome) completed at 200K steps: val_loss=3.089, acc=6.83%
- All GPUs idle. Uploading final checkpoints to HF.
- Probes skipped (performance issues — runs on CPU, too slow on pod).
- Drafting final report.
### 2026-04-03 12:10 UTC — Phase 2 check-in (~5.5h in, ~28%)
| Trial | Ablation | Step | train_loss | acc | LR |
|-------|----------|------|-----------|------|------|
| 24 | mate-boost | 55300 | 3.237 | 6.2% | 8.80e-4 |
| 25 | no-outcome | 53800 | 3.122 | 6.8% | 2.55e-4 |
| 26 | discard-ply-limit | 55200 | 3.214 | 7.0% | 8.80e-4 |
- Cruising. Loss improvement slowing (expected as cosine decay kicks in).
- Relative rankings unchanged: T25 best loss, T26 best acc, T24 trailing slightly.
### 2026-04-03 14:10 UTC — Phase 2 check-in (~7.5h in, ~37%)
| Trial | Ablation | Step | train_loss | acc | LR |
|-------|----------|------|-----------|------|------|
| 24 | mate-boost | 73600 | 3.204 | 6.6% | 7.73e-4 |
| 25 | no-outcome | 71600 | 3.094 | 7.0% | 2.23e-4 |
| 26 | discard-ply-limit | 73500 | 3.192 | 7.5% | 7.74e-4 |
- Steady. Trial 26 acc now at 7.5%. All losses still slowly decreasing.
- HF bucket synced (metrics + lab notes + transcript).
### 2026-04-03 17:10 UTC — Phase 2 HALFWAY (~10.5h in, ~50%)
| Trial | Ablation | Step | train_loss | acc | LR |
|-------|----------|------|-----------|------|------|
| 24 | mate-boost | 101600 | 3.206 | 6.5% | 5.75e-4 |
| 25 | no-outcome | 98300 | 3.090 | 6.9% | 1.69e-4 |
| 26 | discard-ply-limit | 101000 | 3.179 | 7.4% | 5.80e-4 |
- **100K milestone reached.** This is where baseline PAWN-Base stopped.
- Baseline reference: PAWN-Base val_loss ~3.06 at 100K steps.
- Trial 25 (no-outcome) at 3.09 loss — nearly matching baseline despite no outcome token.
- Trial 26 (discard-ply-limit) best accuracy at 7.4%.
- LR decaying steadily (cosine). Trials 24/26 at ~5.8e-4, trial 25 at ~1.7e-4.
- ETA unchanged: ~2026-04-04 04:30 UTC. Cost so far: ~$26 (~11.3h × $2.31).

Xet Storage Details

Size:
12.6 kB
·
Xet hash:
b1c447b86e7834339925d889251a0a3dedacab8f79b5d138635bea8afab1cb1a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.