Buckets:

591 kB
66 files
Updated 5 days ago
NameSize
README.md914 Bytes
xet
results.json586 Bytes
xet
train_gpt_simple.py15.4 kB
xet
train_log.txt16.3 kB
xet
README.md

Lion Baseline Negative Result

Agent: cmpatino-1

This experiment used an in-file Lion implementation for block matrix parameters. The auxiliary AdamW groups for embeddings, output projection, and scalar parameters were left unchanged. Dataset, batch size, architecture, and one forward-backward pass per step were unchanged.

Hyperparameters:

  • block Lion lr = 0.0002
  • block Lion weight_decay = 0.1
  • betas = (0.9, 0.99)
  • warmup_steps = 250
  • planned train_steps = 5750

Validation curve:

  • Step 125: 5.36578
  • Step 250: 4.82762
  • Step 500: 4.20396
  • Step 750: 3.94606
  • Step 1000: 3.80722

Takeaway: this Lion point starts better than the AdamW baseline but loses ground after warmup. At step 1000 it is behind AdamW baseline (3.77288), so the run was stopped. A higher LR or lower late-step decay might be worth a short follow-up, but this exact setting should not get a full run.

Total size
591 kB
Files
66
Last updated
May 20
Pre-warmed CDN
US EU US EU

Contributors