[Nah] = can't fill that section in right now.
Dillionv2
Summary
Task: Text-Generation
Total training time: 35 hours
Inputs: text
Outputs: text
Params: ~1.3M
Final Loss: 3.078
Important Benchmark Scores:
1. ARC Easy - 29.63%
2. BLiMP - 64.96%
3. HellaSwag - 27.27%
Framework: PyTorch, transformers
Author: Paul Courneya (Harley-ml)
Description
Dillionv2 is our second generation model of the Dillion SLM family. It is a significant improvement over v1 (in everything except ARC).
What changed
| Dillion (v1) | Dillionv2 | why |
|---|---|---|
| 9B token count | 24B token count | More tokens allow the model to see more patterns, improving almost everything. |
| FineWeb-edu dataset | 9-source dataset | FineWeb-edu is edu-filtered and pretty narrow in style. 9 sources allow the model to see more patterns, styles, and non-educational text, improving semantics. |
| 72 hidden size | 96 hidden size | 72 was too narrow. 96 would allow the model to capture more complex patterns. |
| 12 num layers | 9 num layers | To stay in the parameter budget. |
| 288 intermediate size | 288 intermediate size | No change. |
| 3 number of heads | 3 number of heads | No change. |
| 3076 vocab size | 2564 vocab size | To free up parameters. |
| SGD optimizer | AdamW optimizer | AdamW is the modern choice and much better than SGD. |
| Cosine scheduler | WSD scheduler | WSD gives a better final loss. |
| Qwen3.5 architecture | Qwen3.5 architecture | No change. |
Training
We trained Dillionv2 for one epoch on 24B tokens for a combined total of 35 hours on an RTX 2060 and two T4s from Kaggle with a batch size of 384 and a gradient accumulation of 2.
Dataset
The dataset is 34B tokens (we only use the first 24B) and 146GB in total:
- FineWeb-edu (35GB): Educational-filtered Common Crawl
- DCLM-Edu (20GB): Educational-filtered webtext
- The Pile Deduped (20GB): Broad, diverse 23-source dataset
- FineWeb-HQ (20GB): Knowledge-filtered Webtext
- FineMath (13GB): Math-filtered Common Crawl
- Cosmopedia-v2 (7GB): Synthetic textbooks
- Wikipedia (5GB): you better know what this is
- NpSetPython-Edu (3.5GB): normalized Python code
- Misc (600MB): LessWrong + HF configs + HF dataset/model cards
Training results
The final loss ended at 3.078, which is a perplexity of 21.417.
benchmarks
| Benchmark | Dillion | Dillionv2 |
|---|---|---|
| BLiMP | 62.94% | 64.96% |
| ARC Easy (Norm) | 31.36% | 29.63% |
| PiQA (Norm) | 53.10% | 53.16% |
| SWAG (Norm) | 30.36% | 32.07% |
| HellaSwag (Norm) | 26.65% | 27.37% |
| ArithMark | 24.80% | 27.00% |
| AVG | 38.20% | 39.03% |
Dillionv2 shows stonger performace on multiple benchmarks than v1, except ARC. For a comphrehensive comparison among many small models, including my own, such as this one, go to AxiomicLab's Open SLM Leaderboard.
generations
[Nah]
Use Cases
- Educational research, learning, etc
- fine-tuning for downstream use
- deployment on edge devices
- or for fun
Limitations
Doesn't have any!! No!! It does not.. alright fine..
- cannot chat, code, reason, or answer factually
- short context
- always unfactual
Inference
[Nah]
License
MIT License. Read the license file here.
Citation
- Downloads last month
- -