File size: 1,677 Bytes
78d9b6b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | ---
language: en
license: mit
library_name: transformers
tags:
- bert
- masked-language-modeling
- mlm
datasets:
- lucadiliello/bookcorpusopen
- wikimedia/wikipedia
---
# BERT-MLM
BERT-base (110M params) trained from scratch with the **classic masked language modeling (MLM)** objective from [Devlin et al., 2018](https://arxiv.org/abs/1810.04805).
This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See [AntonXue/BERT-DLM](https://huggingface.co/AntonXue/BERT-DLM) for the counterpart.
## Training Objective
Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.
## Dataset
- **BookCorpusOpen** () — ~17K books
- **English Wikipedia** (, 20231101.en) — ~6.4M articles
- **Split:** 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
- **Train sequences:** 10,784,085
- **Total train tokens:** 5.52B
## Training Configuration
| Parameter | Value |
|---|---|
| Architecture | (fresh random init) |
| Parameters | 109.5M |
| Sequence length | 512 |
| Global batch size | 256 (128 per GPU x 2 GPUs) |
| Training steps | 100,000 |
| Tokens seen | ~13.1B |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Constant with warmup |
| Warmup steps | 500 |
| Adam betas | (0.9, 0.999) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 2x NVIDIA H100 NVL |
## Usage
## Code
Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT)
|