See the blog post!

pythia-1b "reweighted" (very short continued pretraining) with a mix of data from the crumbly Horizon dataset found via crumb's "Paying 'Attention' to your Dataset" method on pythia-70m -> pythia-160m -> and pythia-410m at different stages.

hparams:

LR: 1e-5
SCHEDULE: cosine with 20% warmup from 0, cooldown to 0
BS: 64
CTX: 2048
everything else you can think of is set to it's default value in the huggingface trainer

validation loss (loss on data not used in training, lower is better)

model arxiv github books wiki webtext
horizon-pythia-ft-1b 2.13 1.30 2.00 2.22 2.71
pythia-1b 2.21 1.30 2.02 2.29 2.72

optimized mixture after 12 training runs, starting with 100 samples each:

subset documents
arxiv 608
github 226
books 613
wiki 1438
webtext 8516

benchmarks that i can actually run in reasonable time:

model arc truthfulqa winogrande
horizon-pythia-ft-1b 28.24 41.13 53.75
pythia-1b deduped* 29.1 38.94 53.59
the actual pythia model isnt on the leaderboard and im really tired and dont want to open the eval script again and wait
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support