[Nah] = can't fill that section in right now.

Dillionv2

Summary

Task: Text-Generation
Total training time: 35 hours
Inputs: text
Outputs: text
Params: ~1.3M
Final Loss: 3.078
Important Benchmark Scores:
   1. ARC Easy - 29.63%
   2. BLiMP - 64.96%
   3. HellaSwag - 27.27%
Framework: PyTorch, transformers
Author: Paul Courneya (Harley-ml)

Description

Dillionv2 is our second generation model of the Dillion SLM family. It is a significant improvement over v1 (in everything except ARC).

What changed

Dillion (v1) Dillionv2 why
9B token count 24B token count More tokens allow the model to see more patterns, improving almost everything.
FineWeb-edu dataset 9-source dataset FineWeb-edu is edu-filtered and pretty narrow in style. 9 sources allow the model to see more patterns, styles, and non-educational text, improving semantics.
72 hidden size 96 hidden size 72 was too narrow. 96 would allow the model to capture more complex patterns.
12 num layers 9 num layers To stay in the parameter budget.
288 intermediate size 288 intermediate size No change.
3 number of heads 3 number of heads No change.
3076 vocab size 2564 vocab size To free up parameters.
SGD optimizer AdamW optimizer AdamW is the modern choice and much better than SGD.
Cosine scheduler WSD scheduler WSD gives a better final loss.
Qwen3.5 architecture Qwen3.5 architecture No change.

Training

We trained Dillionv2 for one epoch on 24B tokens for a combined total of 35 hours on an RTX 2060 and two T4s from Kaggle with a batch size of 384 and a gradient accumulation of 2.

Dataset

The dataset is 34B tokens (we only use the first 24B) and 146GB in total:

  1. FineWeb-edu (35GB): Educational-filtered Common Crawl
  2. DCLM-Edu (20GB): Educational-filtered webtext
  3. The Pile Deduped (20GB): Broad, diverse 23-source dataset
  4. FineWeb-HQ (20GB): Knowledge-filtered Webtext
  5. FineMath (13GB): Math-filtered Common Crawl
  6. Cosmopedia-v2 (7GB): Synthetic textbooks
  7. Wikipedia (5GB): you better know what this is
  8. NpSetPython-Edu (3.5GB): normalized Python code
  9. Misc (600MB): LessWrong + HF configs + HF dataset/model cards

Training results

The final loss ended at 3.078, which is a perplexity of 21.417.

benchmarks

Benchmark Dillion Dillionv2
BLiMP 62.94% 64.96%
ARC Easy (Norm) 31.36% 29.63%
PiQA (Norm) 53.10% 53.16%
SWAG (Norm) 30.36% 32.07%
HellaSwag (Norm) 26.65% 27.37%
ArithMark 24.80% 27.00%
AVG 38.20% 39.03%

Dillionv2 shows stonger performace on multiple benchmarks than v1, except ARC. For a comphrehensive comparison among many small models, including my own, such as this one, go to AxiomicLab's Open SLM Leaderboard.

generations

[Nah]

Use Cases

  1. Educational research, learning, etc
  2. fine-tuning for downstream use
  3. deployment on edge devices
  4. or for fun

Limitations

Doesn't have any!! No!! It does not.. alright fine..

  1. cannot chat, code, reason, or answer factually
  2. short context
  3. always unfactual

Inference

[Nah]

License

MIT License. Read the license file here.

Citation


Downloads last month
-
Safetensors
Model size
1.29M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Harley-ml/Dillionv2-1.3M

Spaces using Harley-ml/Dillionv2-1.3M 2