Dillion-1.5M

Summary

Task: Text-Generation
Total training time: ~2.5 days
Inputs: text
Outputs: text
Params: ~1.2M
Framework: PyTorch, transformers
Author: Paul Courneya (Harley-ml)

Description

Dillion is a 1.2M parameter language model trained on ~9B tokens of FineWeb-edu. Our goal was to make one of the best sub-1.5M parameter LMs through depth (12 layers) and huge overtraining (about 8900 tokens per parameter). Dillion beats or ties with models much larger than itself such as SupraMini-v4-2M and Tenete-8M.

Why "Dillion"?

I was scrolling through Hugging Face and saw GPT-2, the smallest variant. I looked at its download count and saw 16 million. My brain, for some random reason, hallucinated “Dillion.” So I decided to call my next model, no matter the task or size, Dillion.

I decided to dig a bit deeper, and after a quick Google Search, I found that “Dillion” is an alternate spelling of the Irish name Dillon, which translates to “loyal” or “faithful.” But let me tell you, this model ain’t loyal or faithful; actually, it probably doesn’t even know what those words mean.

Architecture

Dillion-1.2M uses the Qwen3.5 architecture.

Parameter Value
NUM_HIDDEN_LAYERS 12
HIDDEN_SIZE 72
NUM_ATTENTION_HEADS 3
NUM_KEY_VALUE_HEADS 3
VOCAB_SIZE 3076
INTERMEDIATE_SIZE 288
ROPE_THETA 10000.0
MAX_POSITION_EMBEDDINGS 384
LAYER_TYPES full_attention

Training

Hardware

We trained Dillion for 0.71 epochs on 14B (only saw ~9B) tokens of FineWeb-edu on an RTX 2060 6GB with a batch size of 72 and a gradient accumulation of 4.

Training Results

epoch train_loss train_ppl train_bpb eval_loss eval_ppl eval_bpb
0.02368 4.553 94.917 1.875 4.492 89.300 1.850
0.04736 3.958 52.353 1.630 3.943 51.573 1.624
0.07104 3.763 43.077 1.550 3.758 42.863 1.548
0.09472 3.672 39.330 1.512 3.670 39.252 1.511
0.11840 3.620 37.338 1.491 3.620 37.338 1.491
0.14210 3.584 36.017 1.476 3.586 36.089 1.477
0.16580 3.557 35.058 1.465 3.558 35.093 1.465
0.18940 3.538 34.398 1.457 3.536 34.329 1.456
0.21310 3.520 33.784 1.450 3.520 33.784 1.450
0.23680 3.504 33.248 1.443 3.507 33.348 1.444
0.26050 3.494 32.917 1.439 3.494 32.917 1.439
0.28420 3.483 32.557 1.434 3.484 32.590 1.435
0.30780 3.475 32.298 1.431 3.475 32.298 1.431
0.33150 3.465 31.976 1.427 3.468 32.073 1.428
0.35520 3.459 31.785 1.425 3.459 31.785 1.425
0.37890 3.452 31.563 1.422 3.454 31.627 1.423
0.40260 3.445 31.343 1.419 3.447 31.406 1.420
0.42620 3.441 31.218 1.417 3.441 31.218 1.417
0.44990 3.437 31.094 1.416 3.437 31.094 1.416
0.47360 3.431 30.908 1.413 3.433 30.969 1.414
0.49730 3.426 30.753 1.411 3.428 30.815 1.412
0.52100 3.423 30.661 1.410 3.424 30.692 1.410
0.54460 3.419 30.539 1.408 3.420 30.569 1.409
0.56830 3.417 30.478 1.407 3.416 30.447 1.407
0.59200 3.413 30.356 1.406 3.413 30.356 1.406
0.61570 3.409 30.235 1.404 3.410 30.265 1.404
0.63940 3.404 30.084 1.402 3.407 30.175 1.403
0.66300 3.403 30.054 1.402 3.403 30.054 1.402
0.68670 3.397 29.874 1.399 3.401 29.994 1.401

Benchmarks

Model Parameters
Dillion 1,281,384
SupraMini-v4-2M 8,293,888
Tenete-8M 2,623,104
Task Metric Dillion SupraMini-v4-2M Tenete-8M
ARC Easy acc_norm 31.36% 0.3194
BLiMP acc 62.94% 60.70%
PiQA acc_norm 53.10% 51.90% 0.5571
SWAG acc_norm 30.36% 0.3297
WikiText bits_per_byte 1.6161
WikiText byte_perplexity 3.0655 3.1652

See the raw output from LM Harnes for Dillion here

Generation Examples

Prompt: The

Output:

 Twitter and Freees of Brooklyn Press, Oxford University.
The Home Council of the Monthly Landing Foundation is a partner with the Great War in the South. The Eighteenth Century has been held on the River Battalion by the Vietnam War, which was laid down by the German Empire to the Nazis. Its first-year period was born on May 1, 1846.

Prompt: Artificial Intelligence is

Output:

a new technology that has been used to make the processes we use.
The Mexican War: Since Ireland, it’s not just one of the most important technologies in America, it can be found in Europe and Japan. The Economics Center for Natural Resources (EU), which was created by Berlin, has become an essential component of its development. Firstly, it will enable the Congress to have the opportunity to create such a system and to generate a great range of resources as well. It also uses a variety of methods to provide more detailed information.
Listen to our article on these tools:
- Published on 2017-2015

Prompt: I was walking down the street and saw a

Output:

 balloon on the ground.
Before you see that the floor, we started to build a large-scale planetary traffic, which makes it possible for people to move from a magma to a hospital when they were picked up in the shore of the first day. They had a small window on their nests and so much fine space into the roof. Then in the gap between them and the width of the tropical Solar Systems.
Many scientists have found that the densely grown snowflakes are being born with the mouth of their own. But there is no evidence of the difference in this condition. The findings are not necessarily an effective way to prevent the spread of the knees and its use as well as other conditions. It's a major issue about how many thousands of molecules will be released

Use Cases

  1. Educational work and research
  2. Fine-tuning for downstream use
  3. Deployment on edge devices
  4. Or for fun.

Limitations

Umm... What do you think? Yeah, everything. But... more speciifcally (yep I splet that wrong; what are you gonna do about it!?)

  1. Cannot chat, reason, code, or answer questions
  2. Always unfactual
  3. No long context handling

Inference

#!/usr/bin/env python3
# =============================================================================
# Inference
# =============================================================================

MODEL_DIR      = "Harley-ml/Dillion-1.2M"
TOKENIZER_PATH = MODEL_DIR

# --- Generation settings ---
PROMPT             = "The"
MAX_NEW_TOKENS     = 362
TEMPERATURE        = 0.6
TOP_P              = 0.95
TOP_K              = 30
REPETITION_PENALTY = 1.2
DO_SAMPLE          = True

# =============================================================================

import os
import torch
from pathlib import Path
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedTokenizerFast,
    AddedToken,
)

# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------

device = (
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

# ---------------------------------------------------------------------------
# Tokenizer
# ---------------------------------------------------------------------------

def load_tokenizer(path_or_repo: str):
    p = Path(path_or_repo)

    # Case 1: explicit local tokenizer.json file
    if p.exists() and p.is_file() and p.suffix.lower() == ".json":
        tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve()))
    # Case 2: local directory or HF repo ID
    else:
        tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True)

    # Ensure required special tokens exist
    if tok.bos_token is None:
        tok.add_special_tokens({"bos_token": "<|bos|>"})
    if tok.eos_token is None:
        tok.add_special_tokens({"eos_token": "<|eos|>"})
    if tok.unk_token is None:
        tok.add_special_tokens({"unk_token": "<|unk|>"})
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>"

    tok.padding_side = "left"
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {len(tokenizer)}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)

model.eval()
model.to(device)

# Safer inference for cache-related issues
model.config.use_cache = False
if hasattr(model, "generation_config") and model.generation_config is not None:
    model.generation_config.use_cache = False

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------

def generate(
    prompt: str = PROMPT,
    max_new_tokens: int = MAX_NEW_TOKENS,
    temperature: float = TEMPERATURE,
    top_p: float = TOP_P,
    top_k: int = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool = DO_SAMPLE,
) -> str:
    bos = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)

    inputs.pop("token_type_ids", None)

    gen_kwargs = dict(
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        repetition_penalty=repetition_penalty,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=False,
    )

    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"] = top_p
        gen_kwargs["top_k"] = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    prompt_len = inputs["input_ids"].shape[-1]
    new_ids = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)

# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)

    output = generate(PROMPT)

    print("Generated:")
    print(output)
Downloads last month
30
Safetensors
Model size
1.28M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Harley-ml/Dillion-1.2M

Space using Harley-ml/Dillion-1.2M 1