SmolDLM-144M
SmolDLM is a 144M parameter block diffusion language model — a new paradigm where text is generated by iteratively denoising masked token blocks, rather than predicting one token at a time like autoregressive models.
This is an early research checkpoint from the SmolDLM project — an open-source effort building diffusion language models from first principles, progressing from character-level toys to modern architectures.
This is a base model trained for 25K steps (~26B tokens of 100B available). See Limitations.
Try the model here: Gooogle Colab
Architecture
| Parameters | 144.47M (tied embeddings) |
| Layers | 30 |
| Hidden dim | 576 |
| Query heads | 9 |
| KV heads | 3 (GQA 3:1) |
| Head dim | 64 |
| MLP hidden | 1,536 |
| Context length | 2,048 tokens |
| Block size | 32 tokens |
| Vocab size | 49,152 |
| Activation | SwiGLU |
| Normalization | RMSNorm (eps=1e-5) |
| Position encoding | RoPE (theta=10,000) |
| Attention | Gated Query Attention + QK-norm |
What makes this architecture different?
- Block Diffusion — Text is generated in blocks of 32 tokens using BD3-LMs staircase attention masks. Each block is denoised over multiple steps, allowing the model to self-correct within a block.
- Gated Query Attention (arXiv:2505.06708) — A sigmoid gate on attention output that eliminates attention sinks, zero-initialized for stable training.
- QK-norm — Per-head RMSNorm on queries and keys for training stability.
- MuonClip Optimizer — Newton-Schulz momentum for 2D weights + AdamW for embeddings, with QK-Clip (tau=100) for attention logit scaling.
- Linear Noise Schedule — mask_prob = t, ELBO weight = 1/t. Mathematically simpler than cosine schedules and proven at scale by LLaDA-8B.
- Document Packing — No right-padding; multiple documents packed per sequence with doc-boundary-aware attention and per-document RoPE reset.
Quickstart
pip install torch safetensors tokenizers huggingface-hub
wget https://huggingface.co/HoangHa/smoldlm-144m/resolve/main/modeling_smoldlm.py
from modeling_smoldlm import SmolDLM, generate
model = SmolDLM.from_pretrained("HoangHa/smoldlm-144m")
text = generate(model, prompt="The meaning of life is")
print(text)
Or use the CLI script directly:
wget https://huggingface.co/HoangHa/smoldlm-144m/resolve/main/generate.py
python generate.py "The meaning of life is"
python generate.py --steps 20 --temperature 0.5 "Once upon a time"
Generation parameters
| Parameter | Default | Description |
|---|---|---|
steps |
10 | Denoising steps per block. More steps = higher quality, slower generation. |
temperature |
0.7 | Sampling temperature. 0 = greedy (not recommended for dLLMs), higher = more random. |
max_new_tokens |
256 | Maximum tokens to generate. |
Tips:
- Use
temperature >= 0.5— greedy decoding causes mode collapse in diffusion LMs (known artifact). - More
stepsimproves coherence, especially for longer generations. Try 15-20 for best quality. - The model generates in blocks of 32 tokens, so generation is naturally "bursty."
Load a specific checkpoint
102 checkpoints are available at 250-step intervals from step 250 to step 25,500:
model = SmolDLM.from_pretrained(
"HoangHa/smoldlm-144m",
checkpoint="checkpoints/model_step_025000.safetensors"
)
How Diffusion LMs Work
Traditional language models (GPT, LLaMA) generate text one token at a time, left to right. Diffusion language models take a fundamentally different approach:
Step 0: [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]
Step 1: [MASK] [MASK] The [MASK] [MASK] [MASK] was [MASK]
Step 2: [MASK] In The [MASK] year [MASK] was great
Step 3: And In The first year 2024 was great
- Start with noise — Fill a block of 32 positions with
<|mask|>tokens - Predict all positions — The model predicts what token belongs at every masked position simultaneously
- Unmask the most confident — Reveal positions where the model is most certain
- Iterate — Re-predict remaining positions with newly revealed context
- Next block — Cache the completed block and move to the next 32 positions
This parallel-within-block generation enables:
- Self-correction — The model can revise earlier predictions within a block as context builds
- Parallel decoding — Multiple tokens generated per forward pass
- Non-autoregressive flexibility — No strict left-to-right constraint within blocks
Training
| Dataset | finepdfs_50BT + dclm_30BT + fineweb_edu_20BT (100B tokens) |
| Steps | 25,500 |
| Tokens seen | ~26B |
| Hardware | H100-80GB GPUs |
| Precision | bfloat16 (AMP) |
| Optimizer | MuonClip (lr=0.02) + AdamW (lr=3e-3) |
| Schedule | WSD (warmup 7%, stable, linear decay at 80%) |
| Gradient clipping | 1.0 |
| Noise schedule | Linear (t ~ U[0.1, 1.0], ELBO weight = 1/t) |
| Framework | PyTorch + DDP |
Training Curves
Loss, gradient norm, and throughput tracked via Trackio.
Training features
- FlexAttention — Compiled block-sparse staircase masks (PyTorch 2.5+)
- Liger Kernels — Fused RMSNorm + SwiGLU
- FP8 Training — Optional tensorwise dynamic scaling on H100+ (
--fp8) - torch.compile — Per-block compilation with Inductor
- Document Packing — No padding waste, attention masked at document boundaries
- Selective Activation Checkpointing — Save matmul/attention outputs, recompute cheap ops
Tokenizer
SmolDLM uses a hybrid BPE tokenizer:
- Merges from SmolLM2 cosmo2-tokenizer (49,138 tokens)
- Pre-tokenizer from Qwen3 (NFC normalization + GPT-4 regex + ByteLevel)
- 14 special tokens at IDs 0-13:
| ID | Token | Purpose |
|---|---|---|
| 0 | <|mask|> |
Diffusion noise token |
| 1 | <|endoftext|> |
Document boundary / EOS |
| 2 | <|padding|> |
Right-padding (unused with doc packing) |
| 3-7 | Chat markers | <|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|> |
| 8-13 | Reasoning & tools | <think>, </think>, <tool_call>, </tool_call>, <tool_response>, </tool_response> |
Limitations
- Early checkpoint — Trained on ~26B of 100B available tokens (26%). The model has learned language structure but generation quality is limited. Coherent multi-sentence output requires further training.
- Repetition — Like other early-stage diffusion LMs, the model can produce repetitive patterns, especially at low temperatures.
- Greedy decoding degenerates — Temperature=0 causes mode collapse (known dLLM artifact). Always use temperature >= 0.5.
- English only — Trained on English web text.
- Not instruction-tuned — Base model, no chat or instruction-following capability.
- Custom architecture — Not compatible with HuggingFace
transformersAutoModel. Use the providedmodeling_smoldlm.py. - No safety training — May generate harmful, biased, or factually incorrect content.
Related Work
SmolDLM builds on ideas from:
- LLaDA (GSAI-ML) — Large-scale diffusion LM proving the paradigm works at 8B
- BD3-LMs (Arriaga et al.) — Block diffusion with staircase attention
- MDLM (Sahoo et al.) — Masked diffusion framework
- SmolLM2 (HuggingFace) — Architecture and tokenizer foundation
- Fast-dLLM (NVIDIA) — Inference acceleration techniques
- Gated Query Attention — Attention sink elimination
License
Apache 2.0
Citation
@misc{smoldlm2026,
title = {SmolDLM: A Small Block Diffusion Language Model},
author = {Hoang Ha},
year = {2026},
url = {https://huggingface.co/HoangHa/smoldlm-144m}
}
- Downloads last month
- -
