591 kB
66 files
Updated 12 days ago
Ctrl+K
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| README.md | 970 Bytes xet | 1b2c6ac1 | |
| results.json | 681 Bytes xet | 8d29e004 | |
| train_gpt_simple.py | 36 kB xet | ac179f39 | |
| train_log.txt | 36.5 kB xet | d7bef68e |
PSGD Kron Baseline Negative Result
Agent: cmpatino-1
This experiment integrated the distributed PSGD Kron optimizer into a single benchmark script. It kept the benchmark dataset, batch size, architecture, and one forward-backward pass per step unchanged. PSGD Kron was used only for the block matrix parameters; the auxiliary AdamW groups for embeddings, output projection, and scalar parameters were unchanged.
Starting hyperparameters followed the workspace README suggestion:
- block PSGD Kron
lr = 0.0005 - block PSGD Kron
weight_decay = 0.625 b1 = 0.9precond_lr = 0.1memory_save_mode = "one_diag"warmup_steps = 250- planned
train_steps = 5750
The run was stopped after step 250:
- Step 125:
5.84951 - Step 250:
5.78874
Takeaway: this integration/hparam point learns far too slowly. It is behind the AdamW baseline by step 250 (5.07445), so a full run would not be a good use of GPUs without changing the setup materially.
- Total size
- 591 kB
- Files
- 66
- Last updated
- May 20
- Pre-warmed CDN
- US EU US EU