BabyLM RoFormer (10M tokens)
A BERT-style masked language model trained from scratch on the BabyLM 10M dataset.
Model Details
- Architecture: RoFormer (BERT + Rotary Position Embeddings)
- Parameters: ~10M
- Training Data: BabyLM Strict-Small (10M tokens)
- Vocabulary: 16,384 tokens (WordPiece)
- Context Length: 128 tokens
Usage
from transformers import RoFormerForMaskedLM, RoFormerTokenizer
model = RoFormerForMaskedLM.from_pretrained("bean4259/babylm-roformer")
tokenizer = RoFormerTokenizer.from_pretrained("bean4259/babylm-roformer")
# Fill-mask example
text = "The cat sat on the [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Training
Trained using a custom training loop with:
- Sequence packing (8.26x compression)
- AdamW optimizer (lr=1e-4)
- Linear warmup + decay schedule
- 10 epochs
- Downloads last month
- -