See the blog post!

pythia-1b "reweighted" (very short continued pretraining) with a mix of data from the crumbly Horizon dataset found via crumb's "Paying 'Attention' to your Dataset" method on pythia-70m -> pythia-160m -> and pythia-410m at different stages.

hparams:

LR: 1e-5
SCHEDULE: cosine with 20% warmup from 0, cooldown to 0
BS: 64
CTX: 2048
everything else you can think of is set to it's default value in the huggingface trainer

validation loss (loss on data not used in training, lower is better)

model	arxiv	github	books	wiki	webtext
horizon-pythia-ft-1b	2.13	1.30	2.00	2.22	2.71
pythia-1b	2.21	1.30	2.02	2.29	2.72

optimized mixture after 12 training runs, starting with 100 samples each:

subset	documents
arxiv	608
github	226
books	613
wiki	1438
webtext	8516

benchmarks that i can actually run in reasonable time:

model	arc	truthfulqa	winogrande
horizon-pythia-ft-1b	28.24	41.13	53.75
pythia-1b deduped*	29.1	38.94	53.59
the actual pythia model isnt on the leaderboard and im really tired and dont want to open the eval script again and wait

Downloads last month: 4