SmolDLM-144M

SmolDLM is a 144M parameter block diffusion language model — a new paradigm where text is generated by iteratively denoising masked token blocks, rather than predicting one token at a time like autoregressive models.

This is an early research checkpoint from the SmolDLM project — an open-source effort building diffusion language models from first principles, progressing from character-level toys to modern architectures.

This is a base model trained for 25K steps (~26B tokens of 100B available). See Limitations.

Try the model here: Gooogle Colab

Architecture


Parameters	144.47M (tied embeddings)
Layers	30
Hidden dim	576
Query heads	9
KV heads	3 (GQA 3:1)
Head dim	64
MLP hidden	1,536
Context length	2,048 tokens
Block size	32 tokens
Vocab size	49,152
Activation	SwiGLU
Normalization	RMSNorm (eps=1e-5)
Position encoding	RoPE (theta=10,000)
Attention	Gated Query Attention + QK-norm

What makes this architecture different?

Block Diffusion — Text is generated in blocks of 32 tokens using BD3-LMs staircase attention masks. Each block is denoised over multiple steps, allowing the model to self-correct within a block.
Gated Query Attention (arXiv:2505.06708) — A sigmoid gate on attention output that eliminates attention sinks, zero-initialized for stable training.
QK-norm — Per-head RMSNorm on queries and keys for training stability.
MuonClip Optimizer — Newton-Schulz momentum for 2D weights + AdamW for embeddings, with QK-Clip (tau=100) for attention logit scaling.
Linear Noise Schedule — mask_prob = t, ELBO weight = 1/t. Mathematically simpler than cosine schedules and proven at scale by LLaDA-8B.
Document Packing — No right-padding; multiple documents packed per sequence with doc-boundary-aware attention and per-document RoPE reset.

Quickstart

pip install torch safetensors tokenizers huggingface-hub
wget https://huggingface.co/HoangHa/smoldlm-144m/resolve/main/modeling_smoldlm.py

from modeling_smoldlm import SmolDLM, generate

model = SmolDLM.from_pretrained("HoangHa/smoldlm-144m")
text = generate(model, prompt="The meaning of life is")
print(text)

Or use the CLI script directly:

wget https://huggingface.co/HoangHa/smoldlm-144m/resolve/main/generate.py

python generate.py "The meaning of life is"
python generate.py --steps 20 --temperature 0.5 "Once upon a time"

Generation parameters

Parameter	Default	Description
`steps`	10	Denoising steps per block. More steps = higher quality, slower generation.
`temperature`	0.7	Sampling temperature. 0 = greedy (not recommended for dLLMs), higher = more random.
`max_new_tokens`	256	Maximum tokens to generate.

Tips:

Use temperature >= 0.5 — greedy decoding causes mode collapse in diffusion LMs (known artifact).
More steps improves coherence, especially for longer generations. Try 15-20 for best quality.
The model generates in blocks of 32 tokens, so generation is naturally "bursty."

Load a specific checkpoint

102 checkpoints are available at 250-step intervals from step 250 to step 25,500:

model = SmolDLM.from_pretrained(
    "HoangHa/smoldlm-144m",
    checkpoint="checkpoints/model_step_025000.safetensors"
)

How Diffusion LMs Work

Traditional language models (GPT, LLaMA) generate text one token at a time, left to right. Diffusion language models take a fundamentally different approach:

Step 0:  [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]
Step 1:  [MASK] [MASK]  The   [MASK] [MASK] [MASK]  was   [MASK]
Step 2:  [MASK]  In     The   [MASK] year   [MASK]  was   great
Step 3:   And    In     The   first  year   2024   was   great

Start with noise — Fill a block of 32 positions with <|mask|> tokens
Predict all positions — The model predicts what token belongs at every masked position simultaneously
Unmask the most confident — Reveal positions where the model is most certain
Iterate — Re-predict remaining positions with newly revealed context
Next block — Cache the completed block and move to the next 32 positions

This parallel-within-block generation enables:

Self-correction — The model can revise earlier predictions within a block as context builds
Parallel decoding — Multiple tokens generated per forward pass
Non-autoregressive flexibility — No strict left-to-right constraint within blocks

Training


Dataset	finepdfs_50BT + dclm_30BT + fineweb_edu_20BT (100B tokens)
Steps	25,500
Tokens seen	~26B
Hardware	H100-80GB GPUs
Precision	bfloat16 (AMP)
Optimizer	MuonClip (lr=0.02) + AdamW (lr=3e-3)
Schedule	WSD (warmup 7%, stable, linear decay at 80%)
Gradient clipping	1.0
Noise schedule	Linear (t ~ U[0.1, 1.0], ELBO weight = 1/t)
Framework	PyTorch + DDP

Training Curves

Loss, gradient norm, and throughput tracked via Trackio.

Training features

FlexAttention — Compiled block-sparse staircase masks (PyTorch 2.5+)
Liger Kernels — Fused RMSNorm + SwiGLU
FP8 Training — Optional tensorwise dynamic scaling on H100+ (--fp8)
torch.compile — Per-block compilation with Inductor
Document Packing — No padding waste, attention masked at document boundaries
Selective Activation Checkpointing — Save matmul/attention outputs, recompute cheap ops

Tokenizer

SmolDLM uses a hybrid BPE tokenizer:

Merges from SmolLM2 cosmo2-tokenizer (49,138 tokens)
Pre-tokenizer from Qwen3 (NFC normalization + GPT-4 regex + ByteLevel)
14 special tokens at IDs 0-13:

ID	Token	Purpose
0	`<\|mask\|>`	Diffusion noise token
1	`<\|endoftext\|>`	Document boundary / EOS
2	`<\|padding\|>`	Right-padding (unused with doc packing)
3-7	Chat markers	`<\|im_start\|>`, `<\|im_end\|>`, `<\|system\|>`, `<\|user\|>`, `<\|assistant\|>`
8-13	Reasoning & tools	`<think>`, `</think>`, `<tool_call>`, `</tool_call>`, `<tool_response>`, `</tool_response>`

Limitations

Early checkpoint — Trained on ~26B of 100B available tokens (26%). The model has learned language structure but generation quality is limited. Coherent multi-sentence output requires further training.
Repetition — Like other early-stage diffusion LMs, the model can produce repetitive patterns, especially at low temperatures.
Greedy decoding degenerates — Temperature=0 causes mode collapse (known dLLM artifact). Always use temperature >= 0.5.
English only — Trained on English web text.
Not instruction-tuned — Base model, no chat or instruction-following capability.
Custom architecture — Not compatible with HuggingFace transformers AutoModel. Use the provided modeling_smoldlm.py.
No safety training — May generate harmful, biased, or factually incorrect content.

Related Work

SmolDLM builds on ideas from:

LLaDA (GSAI-ML) — Large-scale diffusion LM proving the paradigm works at 8B
BD3-LMs (Arriaga et al.) — Block diffusion with staircase attention
MDLM (Sahoo et al.) — Masked diffusion framework
SmolLM2 (HuggingFace) — Architecture and tokenizer foundation
Fast-dLLM (NVIDIA) — Inference acceleration techniques
Gated Query Attention — Attention sink elimination

License

Apache 2.0

Citation

@misc{smoldlm2026,
  title   = {SmolDLM: A Small Block Diffusion Language Model},
  author  = {Hoang Ha},
  year    = {2026},
  url     = {https://huggingface.co/HoangHa/smoldlm-144m}
}

Downloads last month: 234

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train HoangHa/smoldlm-144m

Papers for HoangHa/smoldlm-144m

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Paper • 2505.06708 • Published May 10, 2025 • 11

Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14, 2025 • 128

ActiveSplat: High-Fidelity Scene Reconstruction through Active Gaussian Splatting

Paper • 2410.21955 • Published Oct 29, 2024

Simple and Effective Masked Diffusion Language Models

Paper • 2406.07524 • Published Jun 11, 2024 • 12