BERT-MLM / README.md
AntonXue's picture
Add model card with dataset and training details
78d9b6b verified
---
language: en
license: mit
library_name: transformers
tags:
- bert
- masked-language-modeling
- mlm
datasets:
- lucadiliello/bookcorpusopen
- wikimedia/wikipedia
---
# BERT-MLM
BERT-base (110M params) trained from scratch with the **classic masked language modeling (MLM)** objective from [Devlin et al., 2018](https://arxiv.org/abs/1810.04805).
This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See [AntonXue/BERT-DLM](https://huggingface.co/AntonXue/BERT-DLM) for the counterpart.
## Training Objective
Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.
## Dataset
- **BookCorpusOpen** () — ~17K books
- **English Wikipedia** (, 20231101.en) — ~6.4M articles
- **Split:** 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
- **Train sequences:** 10,784,085
- **Total train tokens:** 5.52B
## Training Configuration
| Parameter | Value |
|---|---|
| Architecture | (fresh random init) |
| Parameters | 109.5M |
| Sequence length | 512 |
| Global batch size | 256 (128 per GPU x 2 GPUs) |
| Training steps | 100,000 |
| Tokens seen | ~13.1B |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Constant with warmup |
| Warmup steps | 500 |
| Adam betas | (0.9, 0.999) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 2x NVIDIA H100 NVL |
## Usage
## Code
Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT)