kd13 commited on
Commit
3aacba5
·
verified ·
1 Parent(s): 95ffa1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md CHANGED
@@ -1,4 +1,95 @@
1
  ---
2
  library_name: transformers
3
  pipeline_tag: fill-mask
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
 
1
  ---
2
  library_name: transformers
3
  pipeline_tag: fill-mask
4
+ license: mit
5
+ datasets:
6
+ - kd13/bookcorpus-clean
7
+ language:
8
+ - en
9
+ metrics:
10
+ - perplexity
11
+ tags:
12
+ - mlm
13
+ ---
14
+
15
+ # BERTmini — Custom BERT with RoPE & Pre-LN Trained from Scratch
16
+
17
+ A compact BERT-style masked language model trained entirely from scratch on BookCorpus. The architecture replaces the canonical absolute positional embeddings with **Rotary Position Embeddings (RoPE)** and adopts a **Pre-Layer Normalization** (Pre-LN) residual layout, both of which have become standard practice in modern transformer training.
18
+
19
+ ---
20
+
21
+ ### Architecture Design Choices
22
+
23
+ **RoPE instead of learned absolute position embeddings.** Rotary embeddings encode positional information directly into the query–key dot product, enabling length generalisation beyond the training window and eliminating a separate learnable parameter table.
24
+
25
+ **Pre-LN residual stream.** Layer normalisation is applied to the *input* of each sub-layer rather than the output. This stabilises gradient flow during early training and generally makes the loss curve smoother, at the cost of requiring an explicit final encoder normalisation before the prediction head.
26
+
27
+ **Embedding tying.** The MLM decoder projection matrix shares weights with the token embedding table, which reduces parameter count and typically improves token prediction quality.
28
+
29
+ ---
30
+
31
+ ## Training Details
32
+
33
+ ### Dataset
34
+
35
+ | Split | Source | Packing |
36
+ |---|---|---|
37
+ | Train | BookCorpus | Fixed-length packed sequences (128 tokens) |
38
+ | Eval | BookCorpus (held-out) | Same packing strategy |
39
+
40
+ Sequences were packed (not padded) to maximise token utilisation per batch, following the approach used in the original BERT paper.
41
+
42
+ ### Pre-training Objective
43
+
44
+ Masked Language Modelling (MLM) with a masking probability of **30 %**. The standard 80/10/10 mask/replace/keep strategy is applied by `DataCollatorForLanguageModeling`.
45
+
46
+ ---
47
+
48
+ ## Usage
49
+
50
+ ```python
51
+ import torch
52
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
53
+
54
+ model_name = "kd13/RoPERT-MLM-mini"
55
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
56
+ model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
57
+
58
+ text = "i don't have much [MASK]."
59
+ inputs = tokenizer(text, return_tensors="pt")
60
+
61
+ with torch.no_grad():
62
+ logits = model(**inputs).logits
63
+
64
+ mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
65
+ mask_token_logits = logits[0, mask_token_index, :]
66
+ top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
67
+
68
+ for token in top_5_tokens:
69
+ print(f">>> {text.replace('[MASK]', tokenizer.decode([token]))}")
70
+
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Limitations
76
+
77
+ - **Domain coverage.** The model was pre-trained exclusively on BookCorpus
78
+ (narrative English text). Performance on technical, scientific, or conversational
79
+ text may be weaker than models trained on broader corpora such as Wikipedia +
80
+ BookCorpus (original BERT) or C4 (RoBERTa-style).
81
+
82
+ - **Sequence length.** The RoPE cache and packing strategy are fixed at 128 tokens.
83
+ While RoPE theoretically supports length extrapolation, the model has not been
84
+ tested beyond this limit.
85
+
86
+ - **Scale.** At ~20 M parameters, this model is best suited for learning experiments,
87
+ fine-tuning research, or resource-constrained deployment scenarios. It is not
88
+ designed to compete with `bert-base-uncased` (110 M) or larger checkpoints on
89
+ downstream benchmarks.
90
+
91
+ - **No fine-tuning benchmarks.** GLUE/SuperGLUE evaluation has not been performed.
92
+ Perplexity on the held-out BookCorpus split (PPL ≈ 12.5) is the only reported
93
+ metric at this stage.
94
+
95
  ---