AntonXue
/

BERT-DLM

+---
+language: en
+license: mit
+library_name: transformers
+tags:
+  - bert
+  - diffusion-language-model
+  - dlm
+  - masked-language-modeling
+datasets:
+  - lucadiliello/bookcorpusopen
+  - wikimedia/wikipedia
+---
+# BERT-DLM
+BERT-base (110M params) trained from scratch with a **modern diffusion language model (DLM)** objective using absorbing-state diffusion with a uniform noise schedule.
+This model is part of a paired experiment comparing classic BERT MLM training against modern DLM training. See [AntonXue/BERT-MLM](https://huggingface.co/AntonXue/BERT-MLM) for the counterpart.
+## Training Objective
+Absorbing-state diffusion with uniform schedule: sample t ~ U(0,1), mask each token independently with probability t (replacing with [MASK]), then predict original tokens at masked positions. Cross-entropy loss on masked positions with uniform time weighting (time_weight = 1).
+Key differences from classic BERT MLM:
+- **Variable mask rate** (0-100%) vs fixed 15% — model sees the full spectrum from nearly clean to nearly destroyed
+- **Always [MASK] replacement** (absorbing state) vs 80/10/10 corruption scheme
+- **Uniform noise schedule** — no cosine time weighting
+## Dataset
+- **BookCorpusOpen** () — ~17K books
+- **English Wikipedia** (, 20231101.en) — ~6.4M articles
+- **Split:** 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
+- **Train sequences:** 10,784,085
+- **Total train tokens:** 5.52B
+## Training Configuration
+| Parameter | Value |
+|---|---|
+| Architecture |  (fresh random init) |
+| Parameters | 109.5M |
+| Sequence length | 512 |
+| Global batch size | 256 (128 per GPU x 2 GPUs) |
+| Training steps | 100,000 |
+| Tokens seen | ~13.1B |
+| Optimizer | AdamW |
+| Learning rate | 1e-4 |
+| LR schedule | Constant with warmup |
+| Warmup steps | 500 |
+| Adam betas | (0.9, 0.999) |
+| Weight decay | 0.01 |
+| Max grad norm | 1.0 |
+| Precision | bf16 |
+| Hardware | 2x NVIDIA H100 NVL |
+## Usage
+## Code
+Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT)