The Optimal Architecture for Small Language Models

Community Article Published December 26, 2025

We trained 19 model configurations across 12 architecture families on 1 billion tokens each. The result? A surprising discovery about what really matters for small language models and a new architecture that's 3.8x faster with better factuality.

TL;DR

In our previous article, we found that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is the optimal dataset mix for training GPT-2, achieving 38.50% average accuracy. But that used the standard 12-layer architecture.

What if we could do even better by changing the model itself?

We ran 19 experiments to find out:

  • 7 GPT-2 variants with radically different depth-width ratios (4→64 layers)
  • 12 architecture families including LLaMA3, Gemma3, Qwen3, MoE, diffusion models, and novel hybrids

Here's what surprised us:

Finding Why It Matters
Models cluster into exactly two performance tiers ~38% vs ~32%—with almost nothing in between
Hidden dimension ≥512 is a hard threshold Below it, even 64 layers can't compensate
32 layers beats 12 layers 38.50% vs 38.15% at the same parameter count
All 12 architectures perform within ~2% LLaMA3, Qwen3, GPT-2—they're all nearly identical at 70M
Diffusion models are 3.8x faster 183 tok/s vs 48 tok/s with parallel token generation
Diffusion models have the best factuality 49.27% TruthfulQA—highest of any architecture
AR→Diffusion conversion needs only 100M tokens 10x more efficient than training from scratch

The result: Dhara-70M, a diffusion model that sacrifices 1.33% accuracy for 3.8x throughput and superior factuality.


The Problem: What's the Optimal Architecture for Small Models?

Our previous work established that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is optimal for training small models. With that dataset recipe fixed, we asked: Does model architecture matter as much as data composition?

The standard GPT-2 uses 12 layers with 768 hidden dimensions. But this was designed in 2019 for ~124M parameters. For a 70M model trained on 1B tokens, is this still optimal? And what about newer architectures like LLaMA, Gemma, MoE, or even diffusion language models?

We set out to systematically map the architecture design space.


Experimental Setup

To isolate the effect of architecture, we fixed everything except model design:

Parameter Value
Total Parameters ~70M (range: 62-77M)
Training Tokens 1 billion
Dataset 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu
Hardware Single NVIDIA A40 GPU
Precision BF16
Optimizer AdamW with cosine schedule

dataset_composition


Part 1: The Depth-Width Trade-off

First, we explored how model shape affects performance by training 7 GPT-2 variants with the same ~70M parameters but radically different depth-width ratios:

Configuration Layers Hidden Params Description
4L Ultra-Wide 4 768 68M Maximum width, minimum depth
12L Wide 12 512 70M Standard GPT-2 scaling
16L Intermediate 16 448 62M Slightly deeper than standard
24L Medium 24 384 62M Transitional depth
32L Goldilocks 32 384 77M Deep with moderate width
48L Deep 48 320 76M Very deep, narrow
64L Deep-Narrow 64 256 64M Maximum depth, minimum width

Discovery #1: The Two-Tier Performance Pattern

Our first finding was completely unexpected. We expected a smooth trade-off curve—more layers for less width, or vice versa. Instead, we found a hard binary split:

two_tier_discovery

Configuration Average Score Tier Gap from High
4L Ultra-Wide 31.98% Low -6.52%
12L Wide 38.15% High
16L Intermediate 32.61% Low -5.89%
24L Medium 31.79% Low -6.71%
32L Goldilocks 38.50% High
48L Deep 32.45% Low -6.05%
64L Deep-Narrow 38.21% High

The gap between tiers is substantial: 6+ percentage points separating them, while variance within each tier is only ~0.5%.

This bimodal distribution is notable: configurations either achieve the high tier (38%) or fall to the low tier (32%), with no intermediate performance levels observed.

Discovery #2: The Hidden Dimension Threshold

Why do some configurations succeed while others fail? We identified the critical factor: hidden_size >= 512.

hidden_threshold

Config Hidden Score Explanation
12L 512 38.15% Meets threshold
16L 448 32.61% Below threshold, depth doesn't compensate
24L 384 31.79% Below threshold, depth doesn't compensate
32L 384 38.50% Below threshold, but OPTIMAL depth compensates
48L 320 32.45% Below threshold, suboptimal depth
64L 256 38.21% Below threshold, but EXTREME depth compensates

The rule emerges: Models need either:

  1. hidden_size >= 512, OR
  2. Exactly 32 layers (the "Goldilocks" depth), OR
  3. Extremely deep (64+ layers) to compensate

The 16L, 24L, and 48L configurations fall into a "dead zone" - their hidden dimensions are too narrow, and their depths aren't at the sweet spots that can compensate.

Discovery #3: 32 Layers is the Goldilocks Depth

With hidden=384, the 32-layer configuration achieves the best overall score (38.50%), slightly beating even the standard 12-layer design.

depth_vs_performance

Benchmark 12L Wide 32L Goldilocks Difference
MMLU 24.11% 25.77% +1.66%
HellaSwag 27.03% 26.46% -0.57%
ARC-Challenge 21.67% 22.27% +0.60%
PIQA 57.29% 58.05% +0.76%
WinoGrande 51.46% 52.64% +1.18%
TruthfulQA 47.31% 45.83% -1.48%
GSM8K 0.99% 1.21% +0.22%
Average 38.15% 38.50% +0.35%

The 32-layer model wins on 5 out of 7 benchmarks, with particular strengths in:

  • WinoGrande (+1.18%): Better pronoun resolution suggests deeper compositional reasoning
  • MMLU (+1.66%): More layers help with academic knowledge retention

Part 2: Architecture Family Comparison

Armed with the optimal 32-layer depth, we compared 12 different architecture families:

Architectures Tested

Architecture Type Parameters Special Features
GPT-2 Classic Transformer 76.48M Learned positional embeddings, LayerNorm
LLaMA3 Modern Transformer 71.25M RoPE, RMSNorm, GQA, SiLU
Qwen3 Modern Transformer 71.25M RoPE, RMSNorm, GQA, SiLU
Gemma3 Modern Transformer 71.27M Sliding window attention (1024), logit capping
LFM2 Hybrid Conv+Attn ~80M Conv-Conv-Attn pattern
dLLM Diffusion LM 71.25M Bidirectional, masked diffusion (MDLM)
MoE Mixture of Experts 327M (67M active) 16 experts, 2 active per token
Titans-MAC Memory-Augmented 67.76M Neural memory modules at layers [0,7,14,21]
dLLM-Recursive Diffusion LM 76.11M Recursive refinement module
LLaMA3-Canon LLaMA3 + Canon 71.34M Depthwise causal convolutions
dLLM-Canon Diffusion + Canon 76.05M Canon layers + bidirectional diffusion
Dhara AR→Diffusion (WSD) 71.34M WSD-converted from LLaMA3-Canon

Complete Benchmark Results

Model HellaSwag PIQA WinoGrande ARC-C MMLU TruthfulQA GSM8K Avg
GPT-2 (32L) 26.46 58.05 52.64 22.27 25.77 45.83 1.21 33.18
LLaMA3 27.17 59.47 50.99 23.21 26.16 43.82 0.00 32.97
Qwen3 26.85 59.41 50.91 18.26 26.62 44.35 0.15 32.36
Gemma3 26.90 59.74 51.54 17.15 26.19 44.20 1.59 32.47
LFM2 26.27 56.96 50.12 17.83 25.95 47.40 0.61 32.16
LLaMA3-Canon 26.72 58.81 51.46 22.27 26.79 44.82 1.67 33.22
MoE 27.30 59.74 50.20 19.62 25.69 47.51 1.06 33.02
Titans-MAC 26.18 57.02 48.78 17.24 25.67 46.26 1.36 31.79
dLLM 25.55 49.67 51.07 21.16 23.96 47.08 0.00 31.21
dLLM-Recursive 24.74 50.44 51.46 22.27 24.04 47.68 0.23 31.55
dLLM-Canon 24.67 50.16 51.46 22.70 24.02 49.27 0.38 31.81
Dhara 25.58 51.58 49.64 24.83 23.85 47.50 0.00 31.85

Discovery #4: Architecture Choice Has Minimal Impact at 70M Scale

Surprisingly, all 12 architecture families achieve similar benchmark accuracy:

  • High tier (AR models): 32-33% average
  • Low tier (Diffusion models): 31-32% average

The differences are within noise at this scale. Modern architectural improvements (RMSNorm, RoPE, GQA) are designed for 7B+ models and don't provide measurable benefits at 70M parameters.

Winner: LLaMA3-Canon (33.22%) slightly edges out GPT-2 (33.18%), but the difference is not statistically significant.

Discovery #5: dLLMs Trade Accuracy for 3.8x Throughput

The real differentiation comes from inference characteristics, not accuracy:

throughput_vs_accuracy

Model Throughput Accuracy Memory TTFT
LLaMA3 50 tok/s 32.97% 0.15 GB 24 ms
GPT-2 (32L) 48 tok/s 33.18% 0.15 GB ~25 ms
MoE 49 tok/s 33.02% 0.62 GB 51 ms
dLLM 289 tok/s 31.21% 0.31 GB 34 ms
Dhara 183 tok/s 31.85% 0.24 GB 35 ms

The trade-off is clear:

  • -1.33% accuracy (31.85% vs 33.18% average)
  • +3.8x throughput (183 vs 48 tok/s)
  • +1.6x memory (bidirectional attention overhead)

Discovery #6: dLLMs Excel at Factuality

One of our most surprising findings: diffusion models achieve the highest TruthfulQA scores among all architectures tested.

Model TruthfulQA Rank
dLLM-Canon 49.27% #1
dLLM-Recursive 47.68% #2
MoE 47.51% #3
Dhara 47.50% #4
LFM2 47.40% #5
dLLM 47.08% #6
GPT-2 (32L) 45.83% #7

task_breakdown

Why might dLLMs excel at factuality? We hypothesize three contributing factors:

  1. Bidirectional attention allows the model to consider full context when making predictions
  2. Iterative refinement enables the model to "second-guess" its initial predictions across multiple denoising steps
  3. Non-autoregressive generation may reduce the snowball effect where early hallucinations compound into larger errors

Discovery #7: Canon Layers Improve Factuality

The "Physics of Language Models" Canon layers (depthwise causal convolutions) show consistent benefits for factuality:

Model Without Canon With Canon Difference
LLaMA3 43.82% 44.82% +1.00%
dLLM 47.08% 49.27% +2.19%

Canon layers add only 0.13% parameter overhead but provide meaningful TruthfulQA improvements.

Discovery #8: WSD Enables 10x Training Efficiency

We find that existing autoregressive models can be converted to diffusion with 10x less training using the Warmup-Stable-Decay (WSD) method from the LLaDA 2.0 paper, which progressively trains an AR model to handle diffusion objectives:

Phase Description Block Sizes % of Training
Warmup Progressive block size increase 1 → 4 → 32 → 64 → 1024 20%
Stable Full MDLM training objective 1024 80%

wsd_efficiency

The efficiency gains are substantial:

Training Method Tokens GPU Time Score Cost (@ $2/hr)
From-scratch (dLLM-Canon) 1B 18h 31.81% ~$36
WSD Conversion (Dhara) 100M 1.8h 31.85% ~$4
Savings 10x 10x +0.04% $32 saved

WSD requires 10x fewer tokens and 10x less GPU time while achieving equivalent or better results.

Notably, WSD conversion not only matches from-scratch training but outperforms it on several benchmarks:

Benchmark WSD From Scratch Improvement
ARC-Challenge 24.83% 22.70% +2.13%
PIQA 51.58% 50.16% +1.42%
HellaSwag 25.58% 24.67% +0.91%

The AR initialization provides learned representations that benefit factual knowledge tasks, suggesting that WSD conversion preserves and potentially enhances knowledge from the source model.


The Result: Dhara-70M

Based on all our discoveries, we introduce Dhara-70M, available on Hugging Face.

Dhara-70M is created by taking the best autoregressive architecture (LLaMA3-Canon) and converting it to a diffusion model using the WSD method. This gives us the best of both worlds: the strong initialization from AR pretraining, plus the throughput and factuality benefits of diffusion.

Architecture

Specification Value
Parameters 71.34M
Layers 32 (Goldilocks depth)
Hidden Size 384
FF Dimension 1024
Attention Heads 8
KV Heads 4 (GQA)
Position Encoding RoPE
Normalization RMSNorm
Special Layers Canon (depthwise causal convolutions)
Generation Diffusion (parallel token generation)
Training LLaMA3-Canon (1B tokens) → WSD conversion (100M tokens)

Benchmark Results

Benchmark Dhara-70M GPT-2 (32L) vs GPT-2
HellaSwag 25.58% 26.46% -0.88%
PIQA 51.58% 58.05% -6.47%
WinoGrande 49.64% 52.64% -3.00%
ARC-Challenge 24.83% 22.27% +2.56%
MMLU 23.85% 25.77% -1.92%
TruthfulQA 47.50% 45.83% +1.67%
GSM8K 0.00% 1.21% -1.21%
Average 31.85% 33.18% -1.33%

Inference Performance

Metric Dhara-70M GPT-2 (32L) Advantage
Time to First Token 35.5 ms ~25 ms 1.4x slower
Throughput 183.5 tok/s ~48 tok/s 3.8x faster
Peak Memory 0.24 GB 0.15 GB 1.6x higher

Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
model = AutoModelForCausalLM.from_pretrained(
    "codelion/dhara-70m",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=50,
    temperature=0.1,
    top_p=0.5,
    top_k=5,
    repetition_penalty=1.8,
    do_sample=True,
    pad_token_id=0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Output:

The future of artificial intelligence is a big challenge.
This world has the potential to improve, but this time we have no other than "theworld."
The next generation will be more exciting and its very much important for our society's
abilityto develop its

For high-throughput batch processing:

# Batch generation for maximum throughput
prompts = [
    "The future of artificial intelligence is",
    "The human brain is capable of",
    "Science has shown that",
    "Technology continues to evolve"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
outputs = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=50,
    temperature=0.1,
    top_p=0.5,
    top_k=5,
    repetition_penalty=1.8,
    do_sample=True,
    pad_token_id=0
)

for i, output in enumerate(outputs):
    print(f"Output {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")

Key Takeaways

After training 19 model configurations across 12 architecture families, here are our core lessons:

  1. Architecture matters less than you think at small scale: All modern architectures (LLaMA3, Qwen3, Gemma3) perform within ~1% of each other at 70M parameters.

  2. Depth-width ratio matters more: The same architecture can yield 32% or 38% accuracy depending on layer/hidden ratio alone.

  3. The two-tier phenomenon is real: Models don't degrade smoothly - they either work (38%) or don't (32%), with a hidden_size threshold of 512.

  4. 32 layers is the Goldilocks depth: For 70M parameters, 32 layers with 384 hidden beats the standard 12-layer design by 0.35%.

  5. Diffusion models excel at factuality: Despite lower average scores, dLLMs achieve the highest TruthfulQA scores (49.27%), suggesting reduced hallucination.

  6. 3.8x throughput is achievable: Diffusion models offer dramatic throughput improvements for batch processing workloads.

  7. WSD conversion is remarkably efficient: Converting AR to diffusion needs only 100M tokens (10x fewer than training from scratch).

  8. Canon layers help factuality: Simple depthwise convolutions add 0.13% parameters but improve TruthfulQA by 1-2%.

For practitioners building small language models: start with the 50-30-20 data mix (from our previous work), use the 32-layer Goldilocks architecture, and consider diffusion for high-throughput applications where factuality matters.


Related Work

Our Previous Work

Diffusion Language Models

Architecture References

Training Efficiency

Datasets

Community

Sign up or log in to comment