RoFormer-slav / README.md
MaximEremeev's picture
Update README.md
334de46 verified
---
language:
- orv
- cu
tags:
- masked-language-modeling
- old-slavonic
- old-russian
- birchbark
- historical-nlp
- roformer
- rope
- bpe
license: apache-2.0
---
# RoFormerBPE
A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts,
using a RoFormer architecture with BPE tokenisation. Based on [mini-roformer-ancient-rus-v2](https://huggingface.co/AlexSychovUN/mini-roformer-ancient-rus-v2).
Note: BPE token boundaries do not always align with lacuna boundaries in editorial markup,
which inflates span-level CER. For character-level restoration tasks consider using
[DualEmbLM](https://huggingface.co/MaximEremeev/DualEmb-slav) instead.
## Architecture
- **Tokenisation**: BPE (Byte Pair Encoding), vocabulary size 50k
- **Architecture**: RoFormer encoder with Rotary Position Embeddings (RoPE)
- **Size**: 6 layers, hidden size 512, 8 attention heads
## Training
The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:
| Source | Language | Word Tokens | Link |
|--------|----------|--------|------|
| Birchbark manuscripts | Old Novgorodian (mostly) | 21,464 | [gramoty.ru](https://gramoty.ru) |
| Epigraphy | Old Church Slavonic (mostly) | 8,102 | [epigraphica.ru](https://epigraphica.ru) |
| DIACU | Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian | 1,683,307 | [ACL Anthology](https://aclanthology.org/2025.bsnlp-1.12/) |
| TOROT | Old Russian; Church Slavonic | 682,430 | [torottreebank.github.io](https://torottreebank.github.io) |
| Bible (Ponomar) | Church Slavonic | 603,047 | [GitHub](https://github.com/typiconman/ponomar/tree/master/Ponomar/languages/cu/bible/elis) |
| Byliny | Old Russian (11th–17th c.) | 430,103 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_003636356/) |
| Pushkin House | Old Russian | 256,503 | [lib2.pushkinskijdom.ru](https://lib2.pushkinskijdom.ru) |
| Military Statute (Part 2) | Old Russian | 49,787 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_004093983/) |
| NKRYA (historical) | Old Russian (11th–18th c.) | 42,412 | [ruscorpora.ru](https://ruscorpora.ru) |
Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.
## Usage
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained(
"MaximEremeev/RoFormer-slav",
trust_remote_code=True,
)
model = AutoModelForMaskedLM.from_pretrained(
"MaximEremeev/RoFormer-slav",
trust_remote_code=True,
)
```
## Tasks
- **Generated lacunae restoration** (Test A Hit@1: 0.281, CER: 0.839)
- **Real lacunae restoration** (Test B char Hit@1: 0.145, span Hit@1: 0.021)
## Contact
Maxim Eremeev, maeremeev@edu.hse.ru