A modified GPT2 architecture with 25m non-embedding parameters, no biases, embedding-ln, scaled sin position embeddings, and a modification that makes the model's transformer run over the sequence four times before going to the language modelling head.

model	avg	arc	hellaswag	mmlu	truthfulqa
horizon-25m-v0	30.625	20.22	26.23	25.9	50.15
cramp-25m	30.57	21.76	27.35	25.53	47.66
gpt2	30.06	22.1	31.6	25.86	40.67
pythia 70m deduped	30.25	21.08	27.17	25.26	47.51
pythia 70m	30.46	21.59	27.29	25.9	47.06
pythia 160m deduped	31.16	24.06	30.34	24.95	44.34
pythia 160m	30.58	22.78	30.34	24.95	44.26

Dataset (Horizon-v0)

Source	Documents
arxiv	8.78k
github	8.82k
books	10k
wiki	14.67k
openwebtext v2	30.73k

Downloads last month: 12