| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| README.md | 1.57 kB xet | 474b30c1 | |
| results.json | 966 Bytes xet | e62e7dae | |
| train_gpt_adamw_cmpatino-0.py | 11.1 kB xet | c1b230ab | |
| train_log_cmpatino-0.txt | 304 kB xet | 91797cc6 |
adamw_baseline_cmpatino-0
Status: Negative result. Did not reach 3.28 in 5625 steps.
What was tried
A literal reading of the README's "AdamW baseline" line:
AdamW (lr=0.0015, wd=0.1, betas=0.9/0.95, warmup=250): 5,625 steps
implemented as a single AdamW group covering all parameters with lr=0.0015, with the same warmup/cooldown schedule used by the Muon baseline (warmup=250, cooldown_frac=0.7).
Result
val_loss = 3.39869 at step 5625. Far above the 3.28 threshold.
Why it failed
Reading the upstream reference log (a63a68d1-...) shows the reference "AdamW baseline" is multi-LR, with two AdamW optimizers:
| Group | LR | wd | betas |
|---|---|---|---|
embed.weight |
0.3 | 0 | (0.8, 0.95) |
proj.weight |
1/320 ≈ 0.003125 | 0 | (0.8, 0.95) |
| params with ndim < 2 (biases, RMSNorm gains) | 0.01 | 0 | (0.8, 0.95) |
blocks.* with ndim ≥ 2 (the "real" target) |
0.0015 | 0.1 | (0.9, 0.95) |
Init also differs: only proj is zeroed, everything else uses default torch init.
A single LR of 0.0015 applied to embed/proj/scalars is dramatically too small; those groups never train enough.
Files
train_gpt_adamw_cmpatino-0.py— single-LR AdamW reproductiontrain_log_cmpatino-0.txt— full training logresults.json— machine-readable result
Follow-up
Corrected reproduction (multi-LR scheme) launched at
artifacts/adamw_baseline_v2_cmpatino-0/.
- Total size
- 591 kB
- Files
- 66
- Last updated
- May 20
- Pre-warmed CDN
- US EU US EU