AntonXue
/

BERT-MLM

masked-language-modeling

Model card Files Files and versions

BERT-MLM / README.md

AntonXue's picture

Add model card with dataset and training details

78d9b6b verified 11 days ago

|

history blame contribute delete

1.68 kB

	---
	language: en
	license: mit
	library_name: transformers
	tags:
	- bert
	- masked-language-modeling
	- mlm
	datasets:
	- lucadiliello/bookcorpusopen
	- wikimedia/wikipedia
	---

	# BERT-MLM

	BERT-base (110M params) trained from scratch with the classic masked language modeling (MLM) objective from [Devlin et al., 2018](https://arxiv.org/abs/1810.04805).

	This model is part of a paired experiment comparing classic BERT MLM training against modern diffusion language model (DLM) training. See [AntonXue/BERT-DLM](https://huggingface.co/AntonXue/BERT-DLM) for the counterpart.

	## Training Objective

	Standard BERT MLM: 15% of tokens selected as targets, with 80/10/10 corruption (80% replaced with [MASK], 10% random token, 10% unchanged). Cross-entropy loss on target positions only.

	## Dataset

	- BookCorpusOpen () — ~17K books
	- English Wikipedia (, 20231101.en) — ~6.4M articles
	- Split: 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
	- Train sequences: 10,784,085
	- Total train tokens: 5.52B

	## Training Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Architecture \| (fresh random init) \|
	\| Parameters \| 109.5M \|
	\| Sequence length \| 512 \|
	\| Global batch size \| 256 (128 per GPU x 2 GPUs) \|
	\| Training steps \| 100,000 \|
	\| Tokens seen \| ~13.1B \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 1e-4 \|
	\| LR schedule \| Constant with warmup \|
	\| Warmup steps \| 500 \|
	\| Adam betas \| (0.9, 0.999) \|
	\| Weight decay \| 0.01 \|
	\| Max grad norm \| 1.0 \|
	\| Precision \| bf16 \|
	\| Hardware \| 2x NVIDIA H100 NVL \|

	## Usage



	## Code

	Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT)