The Optimal Architecture for Small Language Models

Community Article Published December 26, 2025

We trained 19 model configurations across 12 architecture families on 1 billion tokens each. The result? A surprising discovery about what really matters for small language models and a new architecture that's 3.8x faster with better factuality.

TL;DR

In our previous article, we found that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is the optimal dataset mix for training GPT-2, achieving 38.50% average accuracy. But that used the standard 12-layer architecture.

What if we could do even better by changing the model itself?

We ran 19 experiments to find out:

7 GPT-2 variants with radically different depth-width ratios (4→64 layers)
12 architecture families including LLaMA3, Gemma3, Qwen3, MoE, diffusion models, and novel hybrids

Here's what surprised us:

Finding	Why It Matters
Models cluster into exactly two performance tiers	~38% vs ~32%—with almost nothing in between
Hidden dimension ≥512 is a hard threshold	Below it, even 64 layers can't compensate
32 layers beats 12 layers	38.50% vs 38.15% with comparable parameter budgets
All 12 architectures perform within ~2%	LLaMA3, Qwen3, GPT-2—they're all nearly identical at 70M
Diffusion models are 3.8x faster	183 tok/s vs 48 tok/s with parallel token generation
Diffusion models have the best factuality	49.27% TruthfulQA—highest of any architecture
AR→Diffusion conversion needs only 100M tokens	10x more efficient than training from scratch

The result: Dhara-70M, a diffusion model that sacrifices 1.33% accuracy for 3.8x throughput and superior factuality.

The Problem: What's the Optimal Architecture for Small Models?

Our previous work established that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is optimal for training small models. With that dataset recipe fixed, we asked: Does model architecture matter as much as data composition?

The standard GPT-2 uses 12 layers with 768 hidden dimensions. But this was designed in 2019 for ~124M parameters. For a 70M model trained on 1B tokens, is this still optimal? And what about newer architectures like LLaMA, Gemma, MoE, or even diffusion language models?

We set out to systematically map the architecture design space.

Experimental Setup

To isolate the effect of architecture, we fixed everything except model design:

Parameter	Value
Total Parameters	~70M (range: 62-77M)
Training Tokens	1 billion
Dataset	50% FinePDFs + 30% DCLM + 20% FineWeb-Edu
Hardware	Single NVIDIA A40 GPU
Precision	BF16
Optimizer	AdamW with cosine schedule

Part 1: The Depth-Width Trade-off

First, we explored how model shape affects performance by training 7 GPT-2 variants with the same ~70M parameters but radically different depth-width ratios:

Configuration	Layers	Hidden	Params	Description
4L Ultra-Wide	4	768	68M	Maximum width, minimum depth
12L Wide	12	512	70M	Standard GPT-2 scaling
16L Intermediate	16	448	62M	Slightly deeper than standard
24L Medium	24	384	62M	Transitional depth
32L Goldilocks	32	384	77M	Deep with moderate width
48L Deep	48	320	76M	Very deep, narrow
64L Deep-Narrow	64	256	64M	Maximum depth, minimum width

A note on parameter matching: These configurations are approximately matched, not exactly equal. This is because transformer parameters come from two sources with different scaling:

Embeddings: Scale linearly with hidden dimension (vocab × d_model)
Layers: Scale quadratically with hidden dimension (n_layers × 12 × d_model²)

Wider models "spend" more parameters on embeddings, while deeper models spend more on transformer layers. For example, the 4L×768 model has 39M in embeddings but only 28M in layers, while the 64L×256 model has 13M in embeddings but 50M in layers. This tradeoff keeps total parameters roughly similar (62-77M range), though not identical. The 32L Goldilocks configuration has ~10% more total parameters than the 12L baseline, which is a limitation of this comparison.

Discovery #1: The Two-Tier Performance Pattern

Our first finding was completely unexpected. We expected a smooth trade-off curve—more layers for less width, or vice versa. Instead, we found a hard binary split:

Configuration	Average Score	Tier	Gap from High
4L Ultra-Wide	31.98%	Low	-6.52%
12L Wide	38.15%	High	—
16L Intermediate	32.61%	Low	-5.89%
24L Medium	31.79%	Low	-6.71%
32L Goldilocks	38.50%	High	—
48L Deep	32.45%	Low	-6.05%
64L Deep-Narrow	38.21%	High	—

The gap between tiers is substantial: 6+ percentage points separating them, while variance within each tier is only ~0.5%.

This bimodal distribution is notable: configurations either achieve the high tier (~~38%) or fall to the low tier (~~32%), with no intermediate performance levels observed.

Discovery #2: The Hidden Dimension Threshold

Why do some configurations succeed while others fail? We identified the critical factor: hidden_size >= 512.

Config	Hidden	Score	Explanation
12L	512	38.15%	Meets threshold
16L	448	32.61%	Below threshold, depth doesn't compensate
24L	384	31.79%	Below threshold, depth doesn't compensate
32L	384	38.50%	Below threshold, but OPTIMAL depth compensates
48L	320	32.45%	Below threshold, suboptimal depth
64L	256	38.21%	Below threshold, but EXTREME depth compensates

The rule emerges: Models need either:

hidden_size >= 512, OR
Exactly 32 layers (the "Goldilocks" depth), OR
Extremely deep (64+ layers) to compensate

The 16L, 24L, and 48L configurations fall into a "dead zone" - their hidden dimensions are too narrow, and their depths aren't at the sweet spots that can compensate.

Discovery #3: 32 Layers is the Goldilocks Depth

With hidden=384, the 32-layer configuration achieves the best overall score (38.50%), slightly beating even the standard 12-layer design.

Benchmark	12L Wide	32L Goldilocks	Difference
MMLU	24.11%	25.77%	+1.66%
HellaSwag	27.03%	26.46%	-0.57%
ARC-Challenge	21.67%	22.27%	+0.60%
PIQA	57.29%	58.05%	+0.76%
WinoGrande	51.46%	52.64%	+1.18%
TruthfulQA	47.31%	45.83%	-1.48%
GSM8K	0.99%	1.21%	+0.22%
Average	38.15%	38.50%	+0.35%

The 32-layer model wins on 5 out of 7 benchmarks, with particular strengths in:

WinoGrande (+1.18%): Better pronoun resolution suggests deeper compositional reasoning
MMLU (+1.66%): More layers help with academic knowledge retention

Part 2: Architecture Family Comparison

Armed with the optimal 32-layer depth, we compared 12 different architecture families:

Architectures Tested

Architecture	Type	Parameters	Special Features
GPT-2	Classic Transformer	76.48M	Learned positional embeddings, LayerNorm
LLaMA3	Modern Transformer	71.25M	RoPE, RMSNorm, GQA, SiLU
Qwen3	Modern Transformer	71.25M	RoPE, RMSNorm, GQA, SiLU
Gemma3	Modern Transformer	71.27M	Sliding window attention (1024), logit capping
LFM2	Hybrid Conv+Attn	~80M	Conv-Conv-Attn pattern
dLLM	Diffusion LM	71.25M	Bidirectional, masked diffusion (MDLM)
MoE	Mixture of Experts	327M (67M active)	16 experts, 2 active per token
Titans-MAC	Memory-Augmented	67.76M	Neural memory modules at layers [0,7,14,21]
dLLM-Recursive	Diffusion LM	76.11M	Recursive refinement module
LLaMA3-Canon	LLaMA3 + Canon	71.34M	Depthwise causal convolutions
dLLM-Canon	Diffusion + Canon	76.05M	Canon layers + bidirectional diffusion
Dhara	AR→Diffusion (WSD)	71.34M	WSD-converted from LLaMA3-Canon

Complete Benchmark Results

Model	HellaSwag	PIQA	WinoGrande	ARC-C	MMLU	TruthfulQA	GSM8K	Avg
GPT-2 (32L)	26.46	58.05	52.64	22.27	25.77	45.83	1.21	33.18
LLaMA3	27.17	59.47	50.99	23.21	26.16	43.82	0.00	32.97
Qwen3	26.85	59.41	50.91	18.26	26.62	44.35	0.15	32.36
Gemma3	26.90	59.74	51.54	17.15	26.19	44.20	1.59	32.47
LFM2	26.27	56.96	50.12	17.83	25.95	47.40	0.61	32.16
LLaMA3-Canon	26.72	58.81	51.46	22.27	26.79	44.82	1.67	33.22
MoE	27.30	59.74	50.20	19.62	25.69	47.51	1.06	33.02
Titans-MAC	26.18	57.02	48.78	17.24	25.67	46.26	1.36	31.79
dLLM	25.55	49.67	51.07	21.16	23.96	47.08	0.00	31.21
dLLM-Recursive	24.74	50.44	51.46	22.27	24.04	47.68	0.23	31.55
dLLM-Canon	24.67	50.16	51.46	22.70	24.02	49.27	0.38	31.81
Dhara	25.58	51.58	49.64	24.83	23.85	47.50	0.00	31.85

Discovery #4: Architecture Choice Has Minimal Impact at 70M Scale

Surprisingly, all 12 architecture families achieve similar benchmark accuracy:

High tier (AR models): 32-33% average
Low tier (Diffusion models): 31-32% average

The differences are within noise at this scale. Modern architectural improvements (RMSNorm, RoPE, GQA) are designed for 7B+ models and don't provide measurable benefits at 70M parameters.

Winner: LLaMA3-Canon (33.22%) slightly edges out GPT-2 (33.18%), but the difference is not statistically significant.

Discovery #5: dLLMs Trade Accuracy for 3.8x Throughput

The real differentiation comes from inference characteristics, not accuracy:

Model	Throughput	Accuracy	Memory	TTFT
LLaMA3	50 tok/s	32.97%	0.15 GB	24 ms
GPT-2 (32L)	48 tok/s	33.18%	0.15 GB	~25 ms
MoE	49 tok/s	33.02%	0.62 GB	51 ms
dLLM	289 tok/s	31.21%	0.31 GB	34 ms
Dhara	183 tok/s	31.85%	0.24 GB	35 ms

The trade-off is clear:

-1.33% accuracy (31.85% vs 33.18% average)
+3.8x throughput (183 vs 48 tok/s)
+1.6x memory (bidirectional attention overhead)

Discovery #6: dLLMs Excel at Factuality

One of our most surprising findings: diffusion models achieve the highest TruthfulQA scores among all architectures tested.

Model	TruthfulQA	Rank
dLLM-Canon	49.27%	#1
dLLM-Recursive	47.68%	#2
MoE	47.51%	#3
Dhara	47.50%	#4
LFM2	47.40%	#5
dLLM	47.08%	#6
GPT-2 (32L)	45.83%	#7

Why might dLLMs excel at factuality? We hypothesize three contributing factors:

Bidirectional attention allows the model to consider full context when making predictions
Iterative refinement enables the model to "second-guess" its initial predictions across multiple denoising steps
Non-autoregressive generation may reduce the snowball effect where early hallucinations compound into larger errors

Discovery #7: Canon Layers Improve Factuality

The "Physics of Language Models" Canon layers (depthwise causal convolutions) show consistent benefits for factuality:

Model	Without Canon	With Canon	Difference
LLaMA3	43.82%	44.82%	+1.00%
dLLM	47.08%	49.27%	+2.19%

Canon layers add only 0.13% parameter overhead but provide meaningful TruthfulQA improvements.

Discovery #8: WSD Enables 10x Training Efficiency

We find that existing autoregressive models can be converted to diffusion with 10x less training using the Warmup-Stable-Decay (WSD) method from the LLaDA 2.0 paper, which progressively trains an AR model to handle diffusion objectives:

Phase	Description	Block Sizes	% of Training
Warmup	Progressive block size increase	1 → 4 → 32 → 64 → 1024	20%
Stable	Full MDLM training objective	1024	80%

The efficiency gains are substantial:

Training Method	Tokens	GPU Time	Score	Cost (@ $2/hr)
From-scratch (dLLM-Canon)	1B	18h	31.81%	~$36
WSD Conversion (Dhara)	100M	1.8h	31.85%	~$4
Savings	10x	10x	+0.04%	$32 saved

WSD requires 10x fewer tokens and 10x less GPU time while achieving equivalent or better results.

Notably, WSD conversion not only matches from-scratch training but outperforms it on several benchmarks:

Benchmark	WSD	From Scratch	Improvement
ARC-Challenge	24.83%	22.70%	+2.13%
PIQA	51.58%	50.16%	+1.42%
HellaSwag	25.58%	24.67%	+0.91%

The AR initialization provides learned representations that benefit factual knowledge tasks, suggesting that WSD conversion preserves and potentially enhances knowledge from the source model.

The Result: Dhara-70M

Based on all our discoveries, we introduce Dhara-70M, available on Hugging Face.

Dhara-70M is created by taking the best autoregressive architecture (LLaMA3-Canon) and converting it to a diffusion model using the WSD method. This gives us the best of both worlds: the strong initialization from AR pretraining, plus the throughput and factuality benefits of diffusion.

Architecture

Specification	Value
Parameters	71.34M
Layers	32 (Goldilocks depth)
Hidden Size	384
FF Dimension	1024
Attention Heads	8
KV Heads	4 (GQA)
Position Encoding	RoPE
Normalization	RMSNorm
Special Layers	Canon (depthwise causal convolutions)
Generation	Diffusion (parallel token generation)
Training	LLaMA3-Canon (1B tokens) → WSD conversion (100M tokens)

Benchmark Results

Benchmark	Dhara-70M	GPT-2 (32L)	vs GPT-2
HellaSwag	25.58%	26.46%	-0.88%
PIQA	51.58%	58.05%	-6.47%
WinoGrande	49.64%	52.64%	-3.00%
ARC-Challenge	24.83%	22.27%	+2.56%
MMLU	23.85%	25.77%	-1.92%
TruthfulQA	47.50%	45.83%	+1.67%
GSM8K	0.00%	1.21%	-1.21%
Average	31.85%	33.18%	-1.33%

Inference Performance

Metric	Dhara-70M	GPT-2 (32L)	Advantage
Time to First Token	35.5 ms	~25 ms	1.4x slower
Throughput	183.5 tok/s	~48 tok/s	3.8x faster
Peak Memory	0.24 GB	0.15 GB	1.6x higher

Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-70m",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=50,
    temperature=0.1,
    top_p=0.5,
    top_k=5,
    repetition_penalty=1.8,
    do_sample=True,
    pad_token_id=0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Output:

The future of artificial intelligence is a big challenge.
This world has the potential to improve, but this time we have no other than "theworld."
The next generation will be more exciting and its very much important for our society's
abilityto develop its

For high-throughput batch processing:

# Batch generation for maximum throughput
prompts = [
    "The future of artificial intelligence is",
    "The human brain is capable of",
    "Science has shown that",
    "Technology continues to evolve"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
outputs = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=50,
    temperature=0.1,
    top_p=0.5,
    top_k=5,
    repetition_penalty=1.8,
    do_sample=True,
    pad_token_id=0
)

for i, output in enumerate(outputs):
    print(f"Output {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")

Key Takeaways

After training 19 model configurations across 12 architecture families, here are our core lessons:

Architecture matters less than you think at small scale: All modern architectures (LLaMA3, Qwen3, Gemma3) perform within ~1% of each other at 70M parameters.
Depth-width ratio matters more: The same architecture can yield 32% or 38% accuracy depending on layer/hidden ratio alone.
The two-tier phenomenon is real: Models don't degrade smoothly - they either work (~~38%) or don't (~~32%), with a hidden_size threshold of 512.
32 layers is the Goldilocks depth: For 70M parameters, 32 layers with 384 hidden beats the standard 12-layer design by 0.35%.
Diffusion models excel at factuality: Despite lower average scores, dLLMs achieve the highest TruthfulQA scores (49.27%), suggesting reduced hallucination.
3.8x throughput is achievable: Diffusion models offer dramatic throughput improvements for batch processing workloads.
WSD conversion is remarkably efficient: Converting AR to diffusion needs only 100M tokens (10x fewer than training from scratch).
Canon layers help factuality: Simple depthwise convolutions add 0.13% parameters but improve TruthfulQA by 1-2%.

For practitioners building small language models: start with the 50-30-20 data mix (from our previous work), use the 32-layer Goldilocks architecture, and consider diffusion for high-throughput applications where factuality matters.

Related Work

Our Previous Work

The 1 Billion Token Challenge: Optimal Dataset Mixing - Finding the 50-30-20 dataset recipe
GPT-2-70M - Our baseline GPT-2 model

Diffusion Language Models

MDLM: Simple and Effective Masked Diffusion Language Models - The masked diffusion objective we use
dLLM-2: Scaling Diffusion Language Models - Recent scaling study of diffusion LMs
DREAM: Diffusion Rectification and Estimation-Adaptive Models - Training framework for diffusion models

Architecture References

GPT-2 - Original GPT-2 architecture
LLaMA - LLaMA architecture with RoPE and RMSNorm
Physics of Language Models: Part 4.1 - Canon layers (depthwise causal convolutions)
Titans - Memory-as-Context architecture

Training Efficiency

LLaDA2.0: Scaling Up Diffusion Language Models to 100B - The WSD method for AR→Diffusion conversion

Datasets

Pre-training Dataset Samples - 1B token dataset samples used in this work

Models mentioned in this article 2

Collections mentioned in this article 1

Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens

March 6, 2026

Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

January 23, 2026

Community

htafer

Dec 27, 2025

What a fantastic paper, thank you so much for this great work 👍

schonsense

Dec 28, 2025

Can you discuss the additional things contributing to param count that were not mentioned in your article? embedding/head size, intermediate dimensions? I'm struggling to see how a 768x4 has equivalent params to a 512x12.

codelion

Article author Dec 28, 2025

Here's the full breakdown of where parameters come from:

Embeddings (scales linearly with d_model)

Token embeddings: vocab_size × d_model = 50,257 × d
Position embeddings: 1,024 × d
Total: ~51,281 × d

Per transformer layer (scales quadratically with d_model)

Attention (Q, K, V, O): 4 × d²
MLP (up + down, with 4x intermediate): 2 × d × 4d = 8d²
LayerNorms: ~4d (negligible)
Total per layer: ~12d²

LM Head

Usually tied with embeddings (free) or d × vocab_size

4L × 768:

Embeddings: 51,281 × 768 ≈ 39.4M
Layers: 4 × 12 × 768² ≈ 28.3M
Total: ~68M

12L × 512:

Embeddings: 51,281 × 512 ≈ 26.3M
Layers: 12 × 12 × 512² ≈ 37.7M
Total: ~64M

NaiveUser

Dec 28, 2025

Nice work! For reference, here are two related recent papers:
Scaling Inference-Efficient Language Models (https://arxiv.org/pdf/2501.18107)
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs (https://arxiv.org/pdf/2510.18245)

codelion

Article author Dec 28, 2025

Thanks for the references I will take a look.

itsme-nishanth

Dec 28, 2025

Can you share the python notebook used to train this model?

stefan-it

Dec 28, 2025

I am also interested in the training script :)

khtsly

Dec 28, 2025

the script is available in the model card.

khtsly

Dec 28, 2025

also, pleasw consider to dclm-edu instead of dclm-baseline

AbstractPhil

Dec 28, 2025

I like this idea. I've written a couple mini llms based on GPT2 named Beeper. They were a bit too different and built on a couple principles that seemed strong, but didn't necessarily hold up to the pressure yet. Since then I've mothballed the idea but I plan to revisit the plan sooner now that I found your article here.

Fhrozen

Dec 30, 2025

Nice work, it would be helpful to know the details about the inference. Did you use vLLM or Transformers? Are you using a specific Evaluation framework like lighteval or lm-evaluation-harness?

codelion

Article author Dec 30, 2025

For evaluation we used lm-evaluation-harness with a custom wrapper to handle diffusion-specific probability calculations for multiple choice tasks.

For inference we used standard Transformers library. The diffusion models use a custom generate() method that handles parallel token generation with configurable diffusion steps. Throughput was measured with batch size 1, generating 100 tokens per prompt averaged over multiple runs.

redactedacct

Jan 2

This comment has been hidden

shoonee

Jan 7

I’m a bit confused by the “hidden_size ≥ 512 threshold” framing.
High-tier models include 32L (hidden=384) and 64L (hidden=256), while several configs below 512 fail.
This seems less like a strict hidden-size threshold and more like specific depth–width interaction points.
Could you clarify why this is described as a hidden-dimension threshold rather than a joint depth–width effect?

anthonym21

Jan 22

Fantastic work!

MatthewFrank

Feb 9

Fascinating exploration of architectural principles for SLMs! The insights on parameter allocation and layer design are valuable for anyone building efficient models. As someone who frequently needs to document these architectural decisions, I've found InfraSketch (https://www.infrasketch.net/) incredibly useful—you can describe your model architecture in plain English and get a visual diagram instantly. It's been particularly helpful for explaining complex architectural trade-offs to non-technical stakeholders and keeping design documentation up-to-date as we iterate.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

120