Buckets:

591 kB
66 files
Updated about 20 hours ago
NameSize
README.md1.83 kB
xet
log_lr0.0010_wd0.05_5625_cmpatino-0.txt3.96 kB
xet
results.json1.2 kB
xet
run_validation.sh618 Bytes
xet
README.md

adamw_tuned_cmpatino-0

Status: Negative result. Did not beat v2 baseline.

What was tried

Used the README's tuning tip ("halve run length, tune all hparams on the shorter run, then scale back up and retune only WD and LR") to pick best (LR, WD) at 2812 steps, then validated at 5625.

Half-length sweeps (2812 steps, multi-LR AdamW v2)

block_wd block_lr val_loss @ 2812
0.05 0.0015 3.44780
0.10 0.0015 3.46050
0.20 0.0015 3.44864
0.05 0.0010 3.43422 ← best
0.05 0.0020 3.52063
0.05 0.0030 3.50910

Best half-length config: block_lr=0.0010, block_wd=0.05 (3.43422), beating the baseline-scaled (wd=0.10/lr=0.0015) at 2812 by 0.026.

Full-length validation (5625 steps, lr=0.0010 / wd=0.05)

val_loss = 3.30295 at step 5625.

Why it failed

The improvement at half-length did not transfer to full length:

wd=0.10 / lr=0.0015 wd=0.05 / lr=0.0010 Δ
2812 steps (half) 3.46050 3.43422 +0.026 better
5625 steps (full) 3.28434 3.30295 -0.019 worse

The ranking flipped. Likely cause: lower LR converges slower but more stably; the short-run schedule (cooldown starting at step ~844) favors that; the long-run schedule (cooldown starting at step ~1687) doesn't, so the higher LR's extra movement dominates.

Lesson

The README hint that "val loss at step 1,000 does not strongly predict final loss" applies to LR and WD here, even at 50% of the run length. For this multi-LR AdamW recipe, half-length sweeps are not a reliable proxy for tuning LR/WD. Full-length runs are needed.

Files

  • run_validation.sh — launcher used for the 5625-step run
  • log_lr0.0010_wd0.05_5625_cmpatino-0.txt — full training log
  • results.json — machine-readable result
Total size
591 kB
Files
66
Last updated
May 20
Pre-warmed CDN
US EU US EU

Contributors