See the blog post!
pythia-1b "reweighted" (very short continued pretraining) with a mix of data from the crumbly Horizon dataset found via crumb's "Paying 'Attention' to your Dataset" method on pythia-70m -> pythia-160m -> and pythia-410m at different stages.
hparams:
LR: 1e-5
SCHEDULE: cosine with 20% warmup from 0, cooldown to 0
BS: 64
CTX: 2048
everything else you can think of is set to it's default value in the huggingface trainer
validation loss (loss on data not used in training, lower is better)
| model | arxiv | github | books | wiki | webtext |
|---|---|---|---|---|---|
| horizon-pythia-ft-1b | 2.13 | 1.30 | 2.00 | 2.22 | 2.71 |
| pythia-1b | 2.21 | 1.30 | 2.02 | 2.29 | 2.72 |
optimized mixture after 12 training runs, starting with 100 samples each:
| subset | documents |
|---|---|
| arxiv | 608 |
| github | 226 |
| books | 613 |
| wiki | 1438 |
| webtext | 8516 |
benchmarks that i can actually run in reasonable time:
| model | arc | truthfulqa | winogrande |
|---|---|---|---|
| horizon-pythia-ft-1b | 28.24 | 41.13 | 53.75 |
| pythia-1b deduped* | 29.1 | 38.94 | 53.59 |
| the actual pythia model isnt on the leaderboard and im really tired and dont want to open the eval script again and wait |
- Downloads last month
- 4