Simple and Effective Masked Diffusion Language Models
Paper β’ 2406.07524 β’ Published β’ 12
A language diffusion model built on ModernBERT-base, pretrained on Project Gutenberg using a masked diffusion objective.
This is the base pretrained checkpoint before SFT instruction tuning. For instruction following, see JaydeepR/ldm-modernbert-base-sft.
| Property | Value |
|---|---|
| Base model | ModernBERT-base |
| Parameters | ~150M |
| Architecture | Masked Language Model (diffusion objective) |
| Pretrain data | Project Gutenberg (6,400,553 train chunks, seq_len=1024) |
| Pretrain steps | 30,000 |
| Effective batch size | 128 |
| Learning rate | 5e-5 (cosine, 1500 warmup steps) |
| Hardware | RTX 4090 24GB |
| Training time | ~20 hours |
| Initial train loss | 3.887 |
| Initial val loss | 3.922 |
| Final train loss | 2.917 |
| Final val loss | 2.962 |
The model is pretrained using a flow-matching diffusion objective: at each step, a random fraction t of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by 1/t to account for the difficulty of predicting heavily masked sequences.
from transformers import AutoModelForMaskedLM
from safetensors.torch import load_file
import torch
model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()
# Unconditional generation β start from all masked tokens
seq_len = 128
input_tokens = torch.full((1, seq_len), tokenizer.mask_token_id, dtype=torch.long)
Or use the provided scripts from the GitHub repo:
# Generate GIF (unconditional)
bash create_gif.sh
Built following the approach from: