[Nah] = can't fill that section in right now.

Dillionv2

Summary

Task: Text-Generation
Total training time: 35 hours
Inputs: text
Outputs: text
Params: ~1.3M
Final Loss: 3.078
Important Benchmark Scores:
   1. ARC Easy - 29.63%
   2. BLiMP - 64.96%
   3. HellaSwag - 27.27%
Framework: PyTorch, transformers
Author: Paul Courneya (Harley-ml)

Description

Dillionv2 is our second generation model of the Dillion SLM family. It is a significant improvement over v1 (in everything except ARC).

What changed

Dillion (v1)	Dillionv2	why
9B token count	24B token count	More tokens allow the model to see more patterns, improving almost everything.
FineWeb-edu dataset	9-source dataset	FineWeb-edu is edu-filtered and pretty narrow in style. 9 sources allow the model to see more patterns, styles, and non-educational text, improving semantics.
72 hidden size	96 hidden size	72 was too narrow. 96 would allow the model to capture more complex patterns.
12 num layers	9 num layers	To stay in the parameter budget.
288 intermediate size	288 intermediate size	No change.
3 number of heads	3 number of heads	No change.
3076 vocab size	2564 vocab size	To free up parameters.
SGD optimizer	AdamW optimizer	AdamW is the modern choice and much better than SGD.
Cosine scheduler	WSD scheduler	WSD gives a better final loss.
Qwen3.5 architecture	Qwen3.5 architecture	No change.

Training

We trained Dillionv2 for one epoch on 24B tokens for a combined total of 35 hours on an RTX 2060 and two T4s from Kaggle with a batch size of 384 and a gradient accumulation of 2.

Dataset

The dataset is 34B tokens (we only use the first 24B) and 146GB in total:

FineWeb-edu (35GB): Educational-filtered Common Crawl
DCLM-Edu (20GB): Educational-filtered webtext
The Pile Deduped (20GB): Broad, diverse 23-source dataset
FineWeb-HQ (20GB): Knowledge-filtered Webtext
FineMath (13GB): Math-filtered Common Crawl
Cosmopedia-v2 (7GB): Synthetic textbooks
Wikipedia (5GB): you better know what this is
NpSetPython-Edu (3.5GB): normalized Python code
Misc (600MB): LessWrong + HF configs + HF dataset/model cards

Training results

The final loss ended at 3.078, which is a perplexity of 21.417.

benchmarks

Benchmark	Dillion	Dillionv2
BLiMP	62.94%	64.96%
ARC Easy (Norm)	31.36%	29.63%
PiQA (Norm)	53.10%	53.16%
SWAG (Norm)	30.36%	32.07%
HellaSwag (Norm)	26.65%	27.37%
ArithMark	24.80%	27.00%
AVG	38.20%	39.03%

Dillionv2 shows stonger performace on multiple benchmarks than v1, except ARC. For a comphrehensive comparison among many small models, including my own, such as this one, go to AxiomicLab's Open SLM Leaderboard.

generations

[Nah]

Use Cases

Educational research, learning, etc
fine-tuning for downstream use
deployment on edge devices
or for fun

Limitations

Doesn't have any!! No!! It does not.. alright fine..

cannot chat, code, reason, or answer factually
short context
always unfactual

Inference

[Nah]

License

MIT License. Read the license file here.

Citation

Downloads last month: -

Safetensors

Model size

1.29M params

Tensor type

F32

Harley-ml
/

Dillionv2-1.3M