metadata
language: en
license: mit
library_name: transformers
tags:
- bert
- masked-language-modeling
- mlm
datasets:
- lucadiliello/bookcorpusopen
- wikimedia/wikipedia
BERT-MLM
BERT-base (110M params) trained from scratch with the classic masked language modeling (MLM) objective from Devlin et al., 2018.
This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See AntonXue/BERT-DLM for the counterpart.
Training Objective
Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.
Dataset
- BookCorpusOpen () — ~17K books
- English Wikipedia (, 20231101.en) — ~6.4M articles
- Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
- Train sequences: 10,784,085
- Total train tokens: 5.52B
Training Configuration
| Parameter | Value |
|---|---|
| Architecture | (fresh random init) |
| Parameters | 109.5M |
| Sequence length | 512 |
| Global batch size | 256 (128 per GPU x 2 GPUs) |
| Training steps | 100,000 |
| Tokens seen | ~13.1B |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Constant with warmup |
| Warmup steps | 500 |
| Adam betas | (0.9, 0.999) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 2x NVIDIA H100 NVL |
Usage
Code
Training code: github.com/AntonXue/dBERT