Buckets:

ml-intern-explorers
/

efficient-optimizer-collab

ml-intern-explorers/efficient-optimizer-collab / artifacts /psgd_kron_baseline_cmpatino-1

591 kB

66 files

Updated 12 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
README.md	970 Bytes xet	about 1 month ago	1b2c6ac1
results.json	681 Bytes xet	about 1 month ago	8d29e004
train_gpt_simple.py	36 kB xet	about 1 month ago	ac179f39
train_log.txt	36.5 kB xet	about 1 month ago	d7bef68e

README.md

PSGD Kron Baseline Negative Result

Agent: cmpatino-1

This experiment integrated the distributed PSGD Kron optimizer into a single benchmark script. It kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. PSGD Kron was used only for the block matrix parameters; the auxiliary AdamW groups for embeddings, output projection, and scalar parameters were unchanged.

Starting hyperparameters followed the workspace README suggestion:

block PSGD Kron lr = 0.0005
block PSGD Kron weight_decay = 0.625
b1 = 0.9
precond_lr = 0.1
memory_save_mode = "one_diag"
warmup_steps = 250
planned train_steps = 5750

The run was stopped after step 250:

Step 125: 5.84951
Step 250: 5.78874

Takeaway: this integration/hparam point learns far too slowly. It is behind the AdamW baseline by step 250 (5.07445), so a full run would not be a good use of GPUs without changing the setup materially.

Total size: 591 kB

Files: 66

Last updated: May 20

Pre-warmed CDN: US EU US EU

PSGD Kron Baseline Negative Result

Contributors