BERT-MLM / README.md

AntonXue

Add model card with dataset and training details

78d9b6b verified 11 days ago

preview code

raw

history blame contribute delete

1.68 kB

metadata

language: en
license: mit
library_name: transformers
tags:
  - bert
  - masked-language-modeling
  - mlm
datasets:
  - lucadiliello/bookcorpusopen
  - wikimedia/wikipedia

BERT-MLM

BERT-base (110M params) trained from scratch with the classic masked language modeling (MLM) objective from Devlin et al., 2018.

This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See AntonXue/BERT-DLM for the counterpart.

Training Objective

Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.

Dataset

BookCorpusOpen () — ~17K books
English Wikipedia (, 20231101.en) — ~6.4M articles
Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
Train sequences: 10,784,085
Total train tokens: 5.52B

Training Configuration

Parameter	Value
Architecture	(fresh random init)
Parameters	109.5M
Sequence length	512
Global batch size	256 (128 per GPU x 2 GPUs)
Training steps	100,000
Tokens seen	~13.1B
Optimizer	AdamW
Learning rate	1e-4
LR schedule	Constant with warmup
Warmup steps	500
Adam betas	(0.9, 0.999)
Weight decay	0.01
Max grad norm	1.0
Precision	bf16
Hardware	2x NVIDIA H100 NVL

Usage

Code

Training code: github.com/AntonXue/dBERT