--- language: en license: mit library_name: transformers tags: - bert - masked-language-modeling - mlm datasets: - lucadiliello/bookcorpusopen - wikimedia/wikipedia --- # BERT-MLM BERT-base (110M params) trained from scratch with the **classic masked language modeling (MLM)** objective from [Devlin et al., 2018](https://arxiv.org/abs/1810.04805). This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See [AntonXue/BERT-DLM](https://huggingface.co/AntonXue/BERT-DLM) for the counterpart. ## Training Objective Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only. ## Dataset - **BookCorpusOpen** () — ~17K books - **English Wikipedia** (, 20231101.en) — ~6.4M articles - **Split:** 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding) - **Train sequences:** 10,784,085 - **Total train tokens:** 5.52B ## Training Configuration | Parameter | Value | |---|---| | Architecture | (fresh random init) | | Parameters | 109.5M | | Sequence length | 512 | | Global batch size | 256 (128 per GPU x 2 GPUs) | | Training steps | 100,000 | | Tokens seen | ~13.1B | | Optimizer | AdamW | | Learning rate | 1e-4 | | LR schedule | Constant with warmup | | Warmup steps | 500 | | Adam betas | (0.9, 0.999) | | Weight decay | 0.01 | | Max grad norm | 1.0 | | Precision | bf16 | | Hardware | 2x NVIDIA H100 NVL | ## Usage ## Code Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT)