RoFormerBPE
A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts, using a RoFormer architecture with BPE tokenisation. Based on mini-roformer-ancient-rus-v2.
Note: BPE token boundaries do not always align with lacuna boundaries in editorial markup, which inflates span-level CER. For character-level restoration tasks consider using DualEmbLM instead.
Architecture
- Tokenisation: BPE (Byte Pair Encoding), vocabulary size 50k
- Architecture: RoFormer encoder with Rotary Position Embeddings (RoPE)
- Size: 6 layers, hidden size 512, 8 attention heads
Training
The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:
| Source | Language | Word Tokens | Link |
|---|---|---|---|
| Birchbark manuscripts | Old Novgorodian (mostly) | 21,464 | gramoty.ru |
| Epigraphy | Old Church Slavonic (mostly) | 8,102 | epigraphica.ru |
| DIACU | Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian | 1,683,307 | ACL Anthology |
| TOROT | Old Russian; Church Slavonic | 682,430 | torottreebank.github.io |
| Bible (Ponomar) | Church Slavonic | 603,047 | GitHub |
| Byliny | Old Russian (XI–XVII c.) | 430,103 | rusneb.ru |
| Pushkin House | Old Russian | 256,503 | lib2.pushkinskijdom.ru |
| Military Statute (Part 2) | Old Russian | 49,787 | rusneb.ru |
| NKRYA (historical) | Old Russian; Old Rus (XI–XVIII c.) | 42,412 | ruscorpora.ru |
Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.
Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained(
"MaximEremeev/RoFormer-slav",
trust_remote_code=True,
)
model = AutoModelForMaskedLM.from_pretrained(
"MaximEremeev/RoFormer-slav",
trust_remote_code=True,
)
Tasks
- Generated lacunae restoration (Test A Hit@1: 0.267, CER: 0.839)
- Real lacunae restoration (Test B char Hit@1: 0.158, span Hit@1: 0.063)
Contact
Maxim Eremeev, maeremeev@edu.hse.ru
- Downloads last month
- 102
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support