| --- |
| language: en |
| license: mit |
| library_name: transformers |
| tags: |
| - bert |
| - masked-language-modeling |
| - mlm |
| datasets: |
| - lucadiliello/bookcorpusopen |
| - wikimedia/wikipedia |
| --- |
| |
| # BERT-MLM |
|
|
| BERT-base (110M params) trained from scratch with the **classic masked language modeling (MLM)** objective from [Devlin et al., 2018](https://arxiv.org/abs/1810.04805). |
|
|
| This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See [AntonXue/BERT-DLM](https://huggingface.co/AntonXue/BERT-DLM) for the counterpart. |
|
|
| ## Training Objective |
|
|
| Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only. |
|
|
| ## Dataset |
|
|
| - **BookCorpusOpen** () — ~17K books |
| - **English Wikipedia** (, 20231101.en) — ~6.4M articles |
| - **Split:** 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding) |
| - **Train sequences:** 10,784,085 |
| - **Total train tokens:** 5.52B |
|
|
| ## Training Configuration |
|
|
| | Parameter | Value | |
| |---|---| |
| | Architecture | (fresh random init) | |
| | Parameters | 109.5M | |
| | Sequence length | 512 | |
| | Global batch size | 256 (128 per GPU x 2 GPUs) | |
| | Training steps | 100,000 | |
| | Tokens seen | ~13.1B | |
| | Optimizer | AdamW | |
| | Learning rate | 1e-4 | |
| | LR schedule | Constant with warmup | |
| | Warmup steps | 500 | |
| | Adam betas | (0.9, 0.999) | |
| | Weight decay | 0.01 | |
| | Max grad norm | 1.0 | |
| | Precision | bf16 | |
| | Hardware | 2x NVIDIA H100 NVL | |
|
|
| ## Usage |
|
|
|
|
|
|
| ## Code |
|
|
| Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT) |
|
|