kd13
/

RoPERT-MLM-small

Model card Files Files and versions

kd13 commited on 29 days ago

Commit

664d6f8

·

verified ·

1 Parent(s): c600bf2

Update README.md

Files changed (1) hide show

README.md +75 -3

README.md CHANGED Viewed

@@ -1,3 +1,75 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- kd13/bookcorpus-clean
+language:
+- en
+metrics:
+- perplexity
+pipeline_tag: fill-mask
+library_name: transformers
+tags:
+- mlm
+---
+# BERTsmall — Custom BERT with RoPE & Pre-LN Trained from Scratch
+A compact BERT-style masked language model trained entirely from scratch on BookCorpus. The architecture replaces the canonical absolute positional embeddings with **Rotary Position Embeddings (RoPE)** and adopts a **Pre-Layer Normalization** (Pre-LN) residual layout, both of which have become standard practice in modern transformer training.
+---
+### Architecture Design Choices
+**RoPE instead of learned absolute position embeddings.** Rotary embeddings encode positional information directly into the query–key dot product, enabling length generalisation beyond the training window and eliminating a separate learnable parameter table.
+**Pre-LN residual stream.** Layer normalisation is applied to the *input* of each sub-layer rather than the output. This stabilises gradient flow during early training and generally makes the loss curve smoother, at the cost of requiring an explicit final encoder normalisation before the prediction head.
+**Embedding tying.** The MLM decoder projection matrix shares weights with the token embedding table, which reduces parameter count and typically improves token prediction quality.
+---
+## Training Details
+### Dataset
+| Split | Source | Packing |
+|---|---|---|
+| Train | BookCorpus | Fixed-length packed sequences (128 tokens) |
+| Eval | BookCorpus (held-out) | Same packing strategy |
+Sequences were packed (not padded) to maximise token utilisation per batch, following the approach used in the original BERT paper.
+### Pre-training Objective
+Masked Language Modelling (MLM) with a masking probability of **30 %**. The standard 80/10/10 mask/replace/keep strategy is applied by `DataCollatorForLanguageModeling`.
+## Evalution Details
+Perplexity : 7.63
+---
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+model_name = "kd13/RoPERT-MLM-small"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
+text = "i don't have much [MASK]."
+inputs = tokenizer(text, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**inputs).logits
+mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
+mask_token_logits = logits[0, mask_token_index, :]
+top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
+for token in top_5_tokens:
+    print(f">>> {text.replace('[MASK]', tokenizer.decode([token]))}")
+```