File size: 1,677 Bytes
78d9b6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
language: en
license: mit
library_name: transformers
tags:
  - bert
  - masked-language-modeling
  - mlm
datasets:
  - lucadiliello/bookcorpusopen
  - wikimedia/wikipedia
---

# BERT-MLM

BERT-base (110M params) trained from scratch with the **classic masked language modeling (MLM)** objective from [Devlin et al., 2018](https://arxiv.org/abs/1810.04805).

This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See [AntonXue/BERT-DLM](https://huggingface.co/AntonXue/BERT-DLM) for the counterpart.

## Training Objective

Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.

## Dataset

- **BookCorpusOpen** () — ~17K books
- **English Wikipedia** (, 20231101.en) — ~6.4M articles
- **Split:** 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
- **Train sequences:** 10,784,085
- **Total train tokens:** 5.52B

## Training Configuration

| Parameter | Value |
|---|---|
| Architecture |  (fresh random init) |
| Parameters | 109.5M |
| Sequence length | 512 |
| Global batch size | 256 (128 per GPU x 2 GPUs) |
| Training steps | 100,000 |
| Tokens seen | ~13.1B |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Constant with warmup |
| Warmup steps | 500 |
| Adam betas | (0.9, 0.999) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 2x NVIDIA H100 NVL |

## Usage



## Code

Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT)