AntonXue commited on
Commit
dd241a8
·
verified ·
1 Parent(s): 74c7eac

Add model card with dataset and training details

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ library_name: transformers
5
+ tags:
6
+ - bert
7
+ - diffusion-language-model
8
+ - dlm
9
+ - masked-language-modeling
10
+ datasets:
11
+ - lucadiliello/bookcorpusopen
12
+ - wikimedia/wikipedia
13
+ ---
14
+
15
+ # BERT-DLM
16
+
17
+ BERT-base (110M params) trained from scratch with a **modern diffusion language model (DLM)** objective using absorbing-state diffusion with a uniform noise schedule.
18
+
19
+ This model is part of a paired experiment comparing classic BERT MLM training against modern DLM training. See [AntonXue/BERT-MLM](https://huggingface.co/AntonXue/BERT-MLM) for the counterpart.
20
+
21
+ ## Training Objective
22
+
23
+ Absorbing-state diffusion with uniform schedule: sample t ~ U(0,1), mask each token independently with probability t (replacing with [MASK]), then predict original tokens at masked positions. Cross-entropy loss on masked positions with uniform time weighting (time_weight = 1).
24
+
25
+ Key differences from classic BERT MLM:
26
+ - **Variable mask rate** (0-100%) vs fixed 15% — model sees the full spectrum from nearly clean to nearly destroyed
27
+ - **Always [MASK] replacement** (absorbing state) vs 80/10/10 corruption scheme
28
+ - **Uniform noise schedule** — no cosine time weighting
29
+
30
+ ## Dataset
31
+
32
+ - **BookCorpusOpen** () — ~17K books
33
+ - **English Wikipedia** (, 20231101.en) — ~6.4M articles
34
+ - **Split:** 95/5 train/eval on raw documents, then tokenized and packed into 512-token sequences (no padding)
35
+ - **Train sequences:** 10,784,085
36
+ - **Total train tokens:** 5.52B
37
+
38
+ ## Training Configuration
39
+
40
+ | Parameter | Value |
41
+ |---|---|
42
+ | Architecture | (fresh random init) |
43
+ | Parameters | 109.5M |
44
+ | Sequence length | 512 |
45
+ | Global batch size | 256 (128 per GPU x 2 GPUs) |
46
+ | Training steps | 100,000 |
47
+ | Tokens seen | ~13.1B |
48
+ | Optimizer | AdamW |
49
+ | Learning rate | 1e-4 |
50
+ | LR schedule | Constant with warmup |
51
+ | Warmup steps | 500 |
52
+ | Adam betas | (0.9, 0.999) |
53
+ | Weight decay | 0.01 |
54
+ | Max grad norm | 1.0 |
55
+ | Precision | bf16 |
56
+ | Hardware | 2x NVIDIA H100 NVL |
57
+
58
+ ## Usage
59
+
60
+
61
+
62
+ ## Code
63
+
64
+ Training code: [github.com/AntonXue/dBERT](https://github.com/AntonXue/dBERT)