The Optimal Architecture for Small Language Models
We trained 19 model configurations across 12 architecture families on 1 billion tokens each. The result? A surprising discovery about what really matters for small language models and a new architecture that's 3.8x faster with better factuality.
TL;DR
In our previous article, we found that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is the optimal dataset mix for training GPT-2, achieving 38.50% average accuracy. But that used the standard 12-layer architecture.
What if we could do even better by changing the model itself?
We ran 19 experiments to find out:
- 7 GPT-2 variants with radically different depth-width ratios (4→64 layers)
- 12 architecture families including LLaMA3, Gemma3, Qwen3, MoE, diffusion models, and novel hybrids
Here's what surprised us:
| Finding | Why It Matters |
|---|---|
| Models cluster into exactly two performance tiers | ~38% vs ~32%—with almost nothing in between |
| Hidden dimension ≥512 is a hard threshold | Below it, even 64 layers can't compensate |
| 32 layers beats 12 layers | 38.50% vs 38.15% at the same parameter count |
| All 12 architectures perform within ~2% | LLaMA3, Qwen3, GPT-2—they're all nearly identical at 70M |
| Diffusion models are 3.8x faster | 183 tok/s vs 48 tok/s with parallel token generation |
| Diffusion models have the best factuality | 49.27% TruthfulQA—highest of any architecture |
| AR→Diffusion conversion needs only 100M tokens | 10x more efficient than training from scratch |
The result: Dhara-70M, a diffusion model that sacrifices 1.33% accuracy for 3.8x throughput and superior factuality.
The Problem: What's the Optimal Architecture for Small Models?
Our previous work established that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is optimal for training small models. With that dataset recipe fixed, we asked: Does model architecture matter as much as data composition?
The standard GPT-2 uses 12 layers with 768 hidden dimensions. But this was designed in 2019 for ~124M parameters. For a 70M model trained on 1B tokens, is this still optimal? And what about newer architectures like LLaMA, Gemma, MoE, or even diffusion language models?
We set out to systematically map the architecture design space.
Experimental Setup
To isolate the effect of architecture, we fixed everything except model design:
| Parameter | Value |
|---|---|
| Total Parameters | ~70M (range: 62-77M) |
| Training Tokens | 1 billion |
| Dataset | 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu |
| Hardware | Single NVIDIA A40 GPU |
| Precision | BF16 |
| Optimizer | AdamW with cosine schedule |
Part 1: The Depth-Width Trade-off
First, we explored how model shape affects performance by training 7 GPT-2 variants with the same ~70M parameters but radically different depth-width ratios:
| Configuration | Layers | Hidden | Params | Description |
|---|---|---|---|---|
| 4L Ultra-Wide | 4 | 768 | 68M | Maximum width, minimum depth |
| 12L Wide | 12 | 512 | 70M | Standard GPT-2 scaling |
| 16L Intermediate | 16 | 448 | 62M | Slightly deeper than standard |
| 24L Medium | 24 | 384 | 62M | Transitional depth |
| 32L Goldilocks | 32 | 384 | 77M | Deep with moderate width |
| 48L Deep | 48 | 320 | 76M | Very deep, narrow |
| 64L Deep-Narrow | 64 | 256 | 64M | Maximum depth, minimum width |
Discovery #1: The Two-Tier Performance Pattern
Our first finding was completely unexpected. We expected a smooth trade-off curve—more layers for less width, or vice versa. Instead, we found a hard binary split:
| Configuration | Average Score | Tier | Gap from High |
|---|---|---|---|
| 4L Ultra-Wide | 31.98% | Low | -6.52% |
| 12L Wide | 38.15% | High | — |
| 16L Intermediate | 32.61% | Low | -5.89% |
| 24L Medium | 31.79% | Low | -6.71% |
| 32L Goldilocks | 38.50% | High | — |
| 48L Deep | 32.45% | Low | -6.05% |
| 64L Deep-Narrow | 38.21% | High | — |
The gap between tiers is substantial: 6+ percentage points separating them, while variance within each tier is only ~0.5%.
This bimodal distribution is notable: configurations either achieve the high tier (38%) or fall to the low tier (32%), with no intermediate performance levels observed.
Discovery #2: The Hidden Dimension Threshold
Why do some configurations succeed while others fail? We identified the critical factor: hidden_size >= 512.
| Config | Hidden | Score | Explanation |
|---|---|---|---|
| 12L | 512 | 38.15% | Meets threshold |
| 16L | 448 | 32.61% | Below threshold, depth doesn't compensate |
| 24L | 384 | 31.79% | Below threshold, depth doesn't compensate |
| 32L | 384 | 38.50% | Below threshold, but OPTIMAL depth compensates |
| 48L | 320 | 32.45% | Below threshold, suboptimal depth |
| 64L | 256 | 38.21% | Below threshold, but EXTREME depth compensates |
The rule emerges: Models need either:
- hidden_size >= 512, OR
- Exactly 32 layers (the "Goldilocks" depth), OR
- Extremely deep (64+ layers) to compensate
The 16L, 24L, and 48L configurations fall into a "dead zone" - their hidden dimensions are too narrow, and their depths aren't at the sweet spots that can compensate.
Discovery #3: 32 Layers is the Goldilocks Depth
With hidden=384, the 32-layer configuration achieves the best overall score (38.50%), slightly beating even the standard 12-layer design.
| Benchmark | 12L Wide | 32L Goldilocks | Difference |
|---|---|---|---|
| MMLU | 24.11% | 25.77% | +1.66% |
| HellaSwag | 27.03% | 26.46% | -0.57% |
| ARC-Challenge | 21.67% | 22.27% | +0.60% |
| PIQA | 57.29% | 58.05% | +0.76% |
| WinoGrande | 51.46% | 52.64% | +1.18% |
| TruthfulQA | 47.31% | 45.83% | -1.48% |
| GSM8K | 0.99% | 1.21% | +0.22% |
| Average | 38.15% | 38.50% | +0.35% |
The 32-layer model wins on 5 out of 7 benchmarks, with particular strengths in:
- WinoGrande (+1.18%): Better pronoun resolution suggests deeper compositional reasoning
- MMLU (+1.66%): More layers help with academic knowledge retention
Part 2: Architecture Family Comparison
Armed with the optimal 32-layer depth, we compared 12 different architecture families:
Architectures Tested
| Architecture | Type | Parameters | Special Features |
|---|---|---|---|
| GPT-2 | Classic Transformer | 76.48M | Learned positional embeddings, LayerNorm |
| LLaMA3 | Modern Transformer | 71.25M | RoPE, RMSNorm, GQA, SiLU |
| Qwen3 | Modern Transformer | 71.25M | RoPE, RMSNorm, GQA, SiLU |
| Gemma3 | Modern Transformer | 71.27M | Sliding window attention (1024), logit capping |
| LFM2 | Hybrid Conv+Attn | ~80M | Conv-Conv-Attn pattern |
| dLLM | Diffusion LM | 71.25M | Bidirectional, masked diffusion (MDLM) |
| MoE | Mixture of Experts | 327M (67M active) | 16 experts, 2 active per token |
| Titans-MAC | Memory-Augmented | 67.76M | Neural memory modules at layers [0,7,14,21] |
| dLLM-Recursive | Diffusion LM | 76.11M | Recursive refinement module |
| LLaMA3-Canon | LLaMA3 + Canon | 71.34M | Depthwise causal convolutions |
| dLLM-Canon | Diffusion + Canon | 76.05M | Canon layers + bidirectional diffusion |
| Dhara | AR→Diffusion (WSD) | 71.34M | WSD-converted from LLaMA3-Canon |
Complete Benchmark Results
| Model | HellaSwag | PIQA | WinoGrande | ARC-C | MMLU | TruthfulQA | GSM8K | Avg |
|---|---|---|---|---|---|---|---|---|
| GPT-2 (32L) | 26.46 | 58.05 | 52.64 | 22.27 | 25.77 | 45.83 | 1.21 | 33.18 |
| LLaMA3 | 27.17 | 59.47 | 50.99 | 23.21 | 26.16 | 43.82 | 0.00 | 32.97 |
| Qwen3 | 26.85 | 59.41 | 50.91 | 18.26 | 26.62 | 44.35 | 0.15 | 32.36 |
| Gemma3 | 26.90 | 59.74 | 51.54 | 17.15 | 26.19 | 44.20 | 1.59 | 32.47 |
| LFM2 | 26.27 | 56.96 | 50.12 | 17.83 | 25.95 | 47.40 | 0.61 | 32.16 |
| LLaMA3-Canon | 26.72 | 58.81 | 51.46 | 22.27 | 26.79 | 44.82 | 1.67 | 33.22 |
| MoE | 27.30 | 59.74 | 50.20 | 19.62 | 25.69 | 47.51 | 1.06 | 33.02 |
| Titans-MAC | 26.18 | 57.02 | 48.78 | 17.24 | 25.67 | 46.26 | 1.36 | 31.79 |
| dLLM | 25.55 | 49.67 | 51.07 | 21.16 | 23.96 | 47.08 | 0.00 | 31.21 |
| dLLM-Recursive | 24.74 | 50.44 | 51.46 | 22.27 | 24.04 | 47.68 | 0.23 | 31.55 |
| dLLM-Canon | 24.67 | 50.16 | 51.46 | 22.70 | 24.02 | 49.27 | 0.38 | 31.81 |
| Dhara | 25.58 | 51.58 | 49.64 | 24.83 | 23.85 | 47.50 | 0.00 | 31.85 |
Discovery #4: Architecture Choice Has Minimal Impact at 70M Scale
Surprisingly, all 12 architecture families achieve similar benchmark accuracy:
- High tier (AR models): 32-33% average
- Low tier (Diffusion models): 31-32% average
The differences are within noise at this scale. Modern architectural improvements (RMSNorm, RoPE, GQA) are designed for 7B+ models and don't provide measurable benefits at 70M parameters.
Winner: LLaMA3-Canon (33.22%) slightly edges out GPT-2 (33.18%), but the difference is not statistically significant.
Discovery #5: dLLMs Trade Accuracy for 3.8x Throughput
The real differentiation comes from inference characteristics, not accuracy:
| Model | Throughput | Accuracy | Memory | TTFT |
|---|---|---|---|---|
| LLaMA3 | 50 tok/s | 32.97% | 0.15 GB | 24 ms |
| GPT-2 (32L) | 48 tok/s | 33.18% | 0.15 GB | ~25 ms |
| MoE | 49 tok/s | 33.02% | 0.62 GB | 51 ms |
| dLLM | 289 tok/s | 31.21% | 0.31 GB | 34 ms |
| Dhara | 183 tok/s | 31.85% | 0.24 GB | 35 ms |
The trade-off is clear:
- -1.33% accuracy (31.85% vs 33.18% average)
- +3.8x throughput (183 vs 48 tok/s)
- +1.6x memory (bidirectional attention overhead)
Discovery #6: dLLMs Excel at Factuality
One of our most surprising findings: diffusion models achieve the highest TruthfulQA scores among all architectures tested.
| Model | TruthfulQA | Rank |
|---|---|---|
| dLLM-Canon | 49.27% | #1 |
| dLLM-Recursive | 47.68% | #2 |
| MoE | 47.51% | #3 |
| Dhara | 47.50% | #4 |
| LFM2 | 47.40% | #5 |
| dLLM | 47.08% | #6 |
| GPT-2 (32L) | 45.83% | #7 |
Why might dLLMs excel at factuality? We hypothesize three contributing factors:
- Bidirectional attention allows the model to consider full context when making predictions
- Iterative refinement enables the model to "second-guess" its initial predictions across multiple denoising steps
- Non-autoregressive generation may reduce the snowball effect where early hallucinations compound into larger errors
Discovery #7: Canon Layers Improve Factuality
The "Physics of Language Models" Canon layers (depthwise causal convolutions) show consistent benefits for factuality:
| Model | Without Canon | With Canon | Difference |
|---|---|---|---|
| LLaMA3 | 43.82% | 44.82% | +1.00% |
| dLLM | 47.08% | 49.27% | +2.19% |
Canon layers add only 0.13% parameter overhead but provide meaningful TruthfulQA improvements.
Discovery #8: WSD Enables 10x Training Efficiency
We find that existing autoregressive models can be converted to diffusion with 10x less training using the Warmup-Stable-Decay (WSD) method from the LLaDA 2.0 paper, which progressively trains an AR model to handle diffusion objectives:
| Phase | Description | Block Sizes | % of Training |
|---|---|---|---|
| Warmup | Progressive block size increase | 1 → 4 → 32 → 64 → 1024 | 20% |
| Stable | Full MDLM training objective | 1024 | 80% |
The efficiency gains are substantial:
| Training Method | Tokens | GPU Time | Score | Cost (@ $2/hr) |
|---|---|---|---|---|
| From-scratch (dLLM-Canon) | 1B | 18h | 31.81% | ~$36 |
| WSD Conversion (Dhara) | 100M | 1.8h | 31.85% | ~$4 |
| Savings | 10x | 10x | +0.04% | $32 saved |
WSD requires 10x fewer tokens and 10x less GPU time while achieving equivalent or better results.
Notably, WSD conversion not only matches from-scratch training but outperforms it on several benchmarks:
| Benchmark | WSD | From Scratch | Improvement |
|---|---|---|---|
| ARC-Challenge | 24.83% | 22.70% | +2.13% |
| PIQA | 51.58% | 50.16% | +1.42% |
| HellaSwag | 25.58% | 24.67% | +0.91% |
The AR initialization provides learned representations that benefit factual knowledge tasks, suggesting that WSD conversion preserves and potentially enhances knowledge from the source model.
The Result: Dhara-70M
Based on all our discoveries, we introduce Dhara-70M, available on Hugging Face.
Dhara-70M is created by taking the best autoregressive architecture (LLaMA3-Canon) and converting it to a diffusion model using the WSD method. This gives us the best of both worlds: the strong initialization from AR pretraining, plus the throughput and factuality benefits of diffusion.
Architecture
| Specification | Value |
|---|---|
| Parameters | 71.34M |
| Layers | 32 (Goldilocks depth) |
| Hidden Size | 384 |
| FF Dimension | 1024 |
| Attention Heads | 8 |
| KV Heads | 4 (GQA) |
| Position Encoding | RoPE |
| Normalization | RMSNorm |
| Special Layers | Canon (depthwise causal convolutions) |
| Generation | Diffusion (parallel token generation) |
| Training | LLaMA3-Canon (1B tokens) → WSD conversion (100M tokens) |
Benchmark Results
| Benchmark | Dhara-70M | GPT-2 (32L) | vs GPT-2 |
|---|---|---|---|
| HellaSwag | 25.58% | 26.46% | -0.88% |
| PIQA | 51.58% | 58.05% | -6.47% |
| WinoGrande | 49.64% | 52.64% | -3.00% |
| ARC-Challenge | 24.83% | 22.27% | +2.56% |
| MMLU | 23.85% | 25.77% | -1.92% |
| TruthfulQA | 47.50% | 45.83% | +1.67% |
| GSM8K | 0.00% | 1.21% | -1.21% |
| Average | 31.85% | 33.18% | -1.33% |
Inference Performance
| Metric | Dhara-70M | GPT-2 (32L) | Advantage |
|---|---|---|---|
| Time to First Token | 35.5 ms | ~25 ms | 1.4x slower |
| Throughput | 183.5 tok/s | ~48 tok/s | 3.8x faster |
| Peak Memory | 0.24 GB | 0.15 GB | 1.6x higher |
Quick Start
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
model = AutoModelForCausalLM.from_pretrained(
"codelion/dhara-70m",
trust_remote_code=True,
torch_dtype=torch.bfloat16
)
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
inputs.input_ids,
max_new_tokens=50,
temperature=0.1,
top_p=0.5,
top_k=5,
repetition_penalty=1.8,
do_sample=True,
pad_token_id=0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Example Output:
The future of artificial intelligence is a big challenge.
This world has the potential to improve, but this time we have no other than "theworld."
The next generation will be more exciting and its very much important for our society's
abilityto develop its
For high-throughput batch processing:
# Batch generation for maximum throughput
prompts = [
"The future of artificial intelligence is",
"The human brain is capable of",
"Science has shown that",
"Technology continues to evolve"
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=50,
temperature=0.1,
top_p=0.5,
top_k=5,
repetition_penalty=1.8,
do_sample=True,
pad_token_id=0
)
for i, output in enumerate(outputs):
print(f"Output {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")
Key Takeaways
After training 19 model configurations across 12 architecture families, here are our core lessons:
Architecture matters less than you think at small scale: All modern architectures (LLaMA3, Qwen3, Gemma3) perform within ~1% of each other at 70M parameters.
Depth-width ratio matters more: The same architecture can yield 32% or 38% accuracy depending on layer/hidden ratio alone.
The two-tier phenomenon is real: Models don't degrade smoothly - they either work (
38%) or don't (32%), with a hidden_size threshold of 512.32 layers is the Goldilocks depth: For 70M parameters, 32 layers with 384 hidden beats the standard 12-layer design by 0.35%.
Diffusion models excel at factuality: Despite lower average scores, dLLMs achieve the highest TruthfulQA scores (49.27%), suggesting reduced hallucination.
3.8x throughput is achievable: Diffusion models offer dramatic throughput improvements for batch processing workloads.
WSD conversion is remarkably efficient: Converting AR to diffusion needs only 100M tokens (10x fewer than training from scratch).
Canon layers help factuality: Simple depthwise convolutions add 0.13% parameters but improve TruthfulQA by 1-2%.
For practitioners building small language models: start with the 50-30-20 data mix (from our previous work), use the 32-layer Goldilocks architecture, and consider diffusion for high-throughput applications where factuality matters.
Related Work
Our Previous Work
- The 1 Billion Token Challenge: Optimal Dataset Mixing - Finding the 50-30-20 dataset recipe
- GPT-2-70M - Our baseline GPT-2 model
Diffusion Language Models
- MDLM: Simple and Effective Masked Diffusion Language Models - The masked diffusion objective we use
- dLLM-2: Scaling Diffusion Language Models - Recent scaling study of diffusion LMs
- DREAM: Diffusion Rectification and Estimation-Adaptive Models - Training framework for diffusion models
Architecture References
- GPT-2 - Original GPT-2 architecture
- LLaMA - LLaMA architecture with RoPE and RMSNorm
- Physics of Language Models: Part 4.1 - Canon layers (depthwise causal convolutions)
- Titans - Memory-as-Context architecture
Training Efficiency
- LLaDA2.0: Scaling Up Diffusion Language Models to 100B - The WSD method for AR→Diffusion conversion
Datasets
- Pre-training Dataset Samples - 1B token dataset samples used in this work






