SmolDLM-144M

SmolDLM is a 144M parameter block diffusion language model — a new paradigm where text is generated by iteratively denoising masked token blocks, rather than predicting one token at a time like autoregressive models.

This is an early research checkpoint from the SmolDLM project — an open-source effort building diffusion language models from first principles, progressing from character-level toys to modern architectures.

This is a base model trained for 25K steps (~26B tokens of 100B available). See Limitations.

Try the model here: Gooogle Colab

Architecture

Parameters 144.47M (tied embeddings)
Layers 30
Hidden dim 576
Query heads 9
KV heads 3 (GQA 3:1)
Head dim 64
MLP hidden 1,536
Context length 2,048 tokens
Block size 32 tokens
Vocab size 49,152
Activation SwiGLU
Normalization RMSNorm (eps=1e-5)
Position encoding RoPE (theta=10,000)
Attention Gated Query Attention + QK-norm
What makes this architecture different?
  • Block Diffusion — Text is generated in blocks of 32 tokens using BD3-LMs staircase attention masks. Each block is denoised over multiple steps, allowing the model to self-correct within a block.
  • Gated Query Attention (arXiv:2505.06708) — A sigmoid gate on attention output that eliminates attention sinks, zero-initialized for stable training.
  • QK-norm — Per-head RMSNorm on queries and keys for training stability.
  • MuonClip Optimizer — Newton-Schulz momentum for 2D weights + AdamW for embeddings, with QK-Clip (tau=100) for attention logit scaling.
  • Linear Noise Schedule — mask_prob = t, ELBO weight = 1/t. Mathematically simpler than cosine schedules and proven at scale by LLaDA-8B.
  • Document Packing — No right-padding; multiple documents packed per sequence with doc-boundary-aware attention and per-document RoPE reset.

Quickstart

pip install torch safetensors tokenizers huggingface-hub
wget https://huggingface.co/HoangHa/smoldlm-144m/resolve/main/modeling_smoldlm.py
from modeling_smoldlm import SmolDLM, generate

model = SmolDLM.from_pretrained("HoangHa/smoldlm-144m")
text = generate(model, prompt="The meaning of life is")
print(text)

Or use the CLI script directly:

wget https://huggingface.co/HoangHa/smoldlm-144m/resolve/main/generate.py

python generate.py "The meaning of life is"
python generate.py --steps 20 --temperature 0.5 "Once upon a time"
Generation parameters
Parameter Default Description
steps 10 Denoising steps per block. More steps = higher quality, slower generation.
temperature 0.7 Sampling temperature. 0 = greedy (not recommended for dLLMs), higher = more random.
max_new_tokens 256 Maximum tokens to generate.

Tips:

  • Use temperature >= 0.5 — greedy decoding causes mode collapse in diffusion LMs (known artifact).
  • More steps improves coherence, especially for longer generations. Try 15-20 for best quality.
  • The model generates in blocks of 32 tokens, so generation is naturally "bursty."
Load a specific checkpoint

102 checkpoints are available at 250-step intervals from step 250 to step 25,500:

model = SmolDLM.from_pretrained(
    "HoangHa/smoldlm-144m",
    checkpoint="checkpoints/model_step_025000.safetensors"
)

How Diffusion LMs Work

Traditional language models (GPT, LLaMA) generate text one token at a time, left to right. Diffusion language models take a fundamentally different approach:

Step 0:  [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]
Step 1:  [MASK] [MASK]  The   [MASK] [MASK] [MASK]  was   [MASK]
Step 2:  [MASK]  In     The   [MASK] year   [MASK]  was   great
Step 3:   And    In     The   first  year   2024   was   great
  1. Start with noise — Fill a block of 32 positions with <|mask|> tokens
  2. Predict all positions — The model predicts what token belongs at every masked position simultaneously
  3. Unmask the most confident — Reveal positions where the model is most certain
  4. Iterate — Re-predict remaining positions with newly revealed context
  5. Next block — Cache the completed block and move to the next 32 positions

This parallel-within-block generation enables:

  • Self-correction — The model can revise earlier predictions within a block as context builds
  • Parallel decoding — Multiple tokens generated per forward pass
  • Non-autoregressive flexibility — No strict left-to-right constraint within blocks

Training

Dataset finepdfs_50BT + dclm_30BT + fineweb_edu_20BT (100B tokens)
Steps 25,500
Tokens seen ~26B
Hardware H100-80GB GPUs
Precision bfloat16 (AMP)
Optimizer MuonClip (lr=0.02) + AdamW (lr=3e-3)
Schedule WSD (warmup 7%, stable, linear decay at 80%)
Gradient clipping 1.0
Noise schedule Linear (t ~ U[0.1, 1.0], ELBO weight = 1/t)
Framework PyTorch + DDP

Training Curves

Training Curves

Loss, gradient norm, and throughput tracked via Trackio.

Training features
  • FlexAttention — Compiled block-sparse staircase masks (PyTorch 2.5+)
  • Liger Kernels — Fused RMSNorm + SwiGLU
  • FP8 Training — Optional tensorwise dynamic scaling on H100+ (--fp8)
  • torch.compile — Per-block compilation with Inductor
  • Document Packing — No padding waste, attention masked at document boundaries
  • Selective Activation Checkpointing — Save matmul/attention outputs, recompute cheap ops

Tokenizer

SmolDLM uses a hybrid BPE tokenizer:

  • Merges from SmolLM2 cosmo2-tokenizer (49,138 tokens)
  • Pre-tokenizer from Qwen3 (NFC normalization + GPT-4 regex + ByteLevel)
  • 14 special tokens at IDs 0-13:
ID Token Purpose
0 <|mask|> Diffusion noise token
1 <|endoftext|> Document boundary / EOS
2 <|padding|> Right-padding (unused with doc packing)
3-7 Chat markers <|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|>
8-13 Reasoning & tools <think>, </think>, <tool_call>, </tool_call>, <tool_response>, </tool_response>

Limitations

  • Early checkpoint — Trained on ~26B of 100B available tokens (26%). The model has learned language structure but generation quality is limited. Coherent multi-sentence output requires further training.
  • Repetition — Like other early-stage diffusion LMs, the model can produce repetitive patterns, especially at low temperatures.
  • Greedy decoding degenerates — Temperature=0 causes mode collapse (known dLLM artifact). Always use temperature >= 0.5.
  • English only — Trained on English web text.
  • Not instruction-tuned — Base model, no chat or instruction-following capability.
  • Custom architecture — Not compatible with HuggingFace transformers AutoModel. Use the provided modeling_smoldlm.py.
  • No safety training — May generate harmful, biased, or factually incorrect content.

Related Work

SmolDLM builds on ideas from:

  • LLaDA (GSAI-ML) — Large-scale diffusion LM proving the paradigm works at 8B
  • BD3-LMs (Arriaga et al.) — Block diffusion with staircase attention
  • MDLM (Sahoo et al.) — Masked diffusion framework
  • SmolLM2 (HuggingFace) — Architecture and tokenizer foundation
  • Fast-dLLM (NVIDIA) — Inference acceleration techniques
  • Gated Query Attention — Attention sink elimination

License

Apache 2.0

Citation

@misc{smoldlm2026,
  title   = {SmolDLM: A Small Block Diffusion Language Model},
  author  = {Hoang Ha},
  year    = {2026},
  url     = {https://huggingface.co/HoangHa/smoldlm-144m}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train HoangHa/smoldlm-144m

Papers for HoangHa/smoldlm-144m