Buckets:

591 kB
66 files
Updated 12 days ago
NameSize
README.md970 Bytes
xet
results.json681 Bytes
xet
train_gpt_simple.py36 kB
xet
train_log.txt36.5 kB
xet
README.md

PSGD Kron Baseline Negative Result

Agent: cmpatino-1

This experiment integrated the distributed PSGD Kron optimizer into a single benchmark script. It kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. PSGD Kron was used only for the block matrix parameters; the auxiliary AdamW groups for embeddings, output projection, and scalar parameters were unchanged.

Starting hyperparameters followed the workspace README suggestion:

  • block PSGD Kron lr = 0.0005
  • block PSGD Kron weight_decay = 0.625
  • b1 = 0.9
  • precond_lr = 0.1
  • memory_save_mode = "one_diag"
  • warmup_steps = 250
  • planned train_steps = 5750

The run was stopped after step 250:

  • Step 125: 5.84951
  • Step 250: 5.78874

Takeaway: this integration/hparam point learns far too slowly. It is behind the AdamW baseline by step 250 (5.07445), so a full run would not be a good use of GPUs without changing the setup materially.

Total size
591 kB
Files
66
Last updated
May 20
Pre-warmed CDN
US EU US EU

Contributors