| --- |
| language: |
| - orv |
| - cu |
| tags: |
| - masked-language-modeling |
| - old-slavonic |
| - old-russian |
| - birchbark |
| - historical-nlp |
| - dual-embeddings |
| license: apache-2.0 |
| --- |
| |
| # DualEmbLM |
|
|
| A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts, |
| with dual character-level + word-level embeddings. |
|
|
| ## Architecture |
|
|
| DualEmbLM combines: |
| - **Character-level tokenisation** (1 character = 1 token) — enables precise lacuna restoration at the character level |
| - **Word-level context embeddings** — provides morphological and lexical context via a 50k word vocabulary |
| - **Transformer encoder** (BERT architecture, trained from scratch) — 6 layers, hidden size 512, 8 attention heads |
|
|
| The dual embeddings are concatenated and projected into the shared |
| hidden space before being passed to the transformer encoder. |
|
|
| ## Training |
|
|
| The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources: |
|
|
| | Source | Language | Word Tokens | Link | |
| |--------|----------|--------|------| |
| | Birchbark manuscripts | Old Novgorodian (mostly) | 21,464 | [gramoty.ru](https://gramoty.ru) | |
| | Epigraphy | Old Church Slavonic (mostly) | 8,102 | [epigraphica.ru](https://epigraphica.ru) | |
| | DIACU | Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian | 1,683,307 | [ACL Anthology](https://aclanthology.org/2025.bsnlp-1.12/) | |
| | TOROT | Old Russian; Church Slavonic | 682,430 | [torottreebank.github.io](https://torottreebank.github.io) | |
| | Bible (Ponomar) | Church Slavonic | 603,047 | [GitHub](https://github.com/typiconman/ponomar/tree/master/Ponomar/languages/cu/bible/elis) | |
| | Byliny | Old Russian (XI–XVII c.) | 430,103 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_003636356/) | |
| | Pushkin House | Old Russian | 256,503 | [lib2.pushkinskijdom.ru](https://lib2.pushkinskijdom.ru) | |
| | Military Statute (Part 2) | Old Russian | 49,787 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_004093983/) | |
| | NKRYA (historical) | Old Russian; Old Rus (XI–XVIII c.) | 42,412 | [ruscorpora.ru](https://ruscorpora.ru) | |
|
|
| Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForMaskedLM |
| |
| model = AutoModelForMaskedLM.from_pretrained( |
| "MaximEremeev/DualEmb-slav", |
| trust_remote_code=True, |
| ) |
| ``` |
|
|
| ## Tasks |
|
|
| - **Generated lacunae restoration** (Test A Hit@1: 0.817, CER: 0.183) |
| - **Real lacunae restoration** (Test B char Hit@1: 0.466, span Hit@1: 0.222) |
|
|
| ## Contact |
|
|
| Maxim Eremeev, maeremeev@edu.hse.ru |