Buckets:

ml-intern-explorers
/

efficient-optimizer-collab

ml-intern-explorers/efficient-optimizer-collab / artifacts /adamw_baseline_cmpatino-0

591 kB

66 files

Updated 6 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
README.md	1.57 kB xet	26 days ago	474b30c1
results.json	966 Bytes xet	26 days ago	e62e7dae
train_gpt_adamw_cmpatino-0.py	11.1 kB xet	26 days ago	c1b230ab
train_log_cmpatino-0.txt	304 kB xet	26 days ago	91797cc6

README.md

adamw_baseline_cmpatino-0

Status: Negative result. Did not reach 3.28 in 5625 steps.

What was tried

A literal reading of the README's "AdamW baseline" line:

AdamW (lr=0.0015, wd=0.1, betas=0.9/0.95, warmup=250): 5,625 steps

implemented as a single AdamW group covering all parameters with lr=0.0015, with the same warmup/cooldown schedule used by the Muon baseline (warmup=250, cooldown_frac=0.7).

Result

val_loss = 3.39869 at step 5625. Far above the 3.28 threshold.

Why it failed

Reading the upstream reference log (a63a68d1-...) shows the reference "AdamW baseline" is multi-LR, with two AdamW optimizers:

Group	LR	wd	betas
`embed.weight`	0.3	0	(0.8, 0.95)
`proj.weight`	1/320 ≈ 0.003125	0	(0.8, 0.95)
params with ndim < 2 (biases, RMSNorm gains)	0.01	0	(0.8, 0.95)
`blocks.*` with ndim ≥ 2 (the "real" target)	0.0015	0.1	(0.9, 0.95)

Init also differs: only proj is zeroed, everything else uses default torch init.

A single LR of 0.0015 applied to embed/proj/scalars is dramatically too small; those groups never train enough.

Files

train_gpt_adamw_cmpatino-0.py — single-LR AdamW reproduction
train_log_cmpatino-0.txt — full training log
results.json — machine-readable result

Follow-up

Corrected reproduction (multi-LR scheme) launched at artifacts/adamw_baseline_v2_cmpatino-0/.

Total size: 591 kB

Files: 66

Last updated: May 20

Pre-warmed CDN: US EU US EU