File size: 2,612 Bytes
c5cd2b8
70cf647
 
 
3bfb9a0
70cf647
 
 
 
 
 
c5cd2b8
 
3bfb9a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8aff420
 
70cf647
8aff420
70cf647
 
8aff420
 
 
 
 
 
 
 
 
3bfb9a0
 
 
 
 
 
 
7618c4c
3bfb9a0
 
 
 
 
 
30e437a
 
3bfb9a0
fc05943
3bfb9a0
fc05943
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
language:
- orv
- cu
tags:
- masked-language-modeling
- old-slavonic
- old-russian
- birchbark
- historical-nlp
- dual-embeddings
license: apache-2.0
---

# DualEmbLM

A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts,
with dual character-level + word-level embeddings.

## Architecture

DualEmbLM combines:
- **Character-level tokenisation** (1 character = 1 token) — enables precise lacuna restoration at the character level
- **Word-level context embeddings** — provides morphological and lexical context via a 50k word vocabulary
- **Transformer encoder** (BERT architecture, trained from scratch) — 6 layers, hidden size 512, 8 attention heads

The dual embeddings are concatenated and projected into the shared
hidden space before being passed to the transformer encoder.

## Training

The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:

| Source | Language | Word Tokens | Link |
|--------|----------|--------|------|
| Birchbark manuscripts | Old Novgorodian (mostly) | 21,464 | [gramoty.ru](https://gramoty.ru) |
| Epigraphy | Old Church Slavonic (mostly) | 8,102 | [epigraphica.ru](https://epigraphica.ru) |
| DIACU | Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian | 1,683,307 | [ACL Anthology](https://aclanthology.org/2025.bsnlp-1.12/) |
| TOROT | Old Russian; Church Slavonic | 682,430 | [torottreebank.github.io](https://torottreebank.github.io) |
| Bible (Ponomar) | Church Slavonic | 603,047 | [GitHub](https://github.com/typiconman/ponomar/tree/master/Ponomar/languages/cu/bible/elis) |
| Byliny | Old Russian (XI–XVII c.) | 430,103 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_003636356/) |
| Pushkin House | Old Russian | 256,503 | [lib2.pushkinskijdom.ru](https://lib2.pushkinskijdom.ru) |
| Military Statute (Part 2) | Old Russian | 49,787 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_004093983/) |
| NKRYA (historical) | Old Russian; Old Rus (XI–XVIII c.) | 42,412 | [ruscorpora.ru](https://ruscorpora.ru) |

Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.

## Usage

```python
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(
    "MaximEremeev/DualEmb-slav",
    trust_remote_code=True,
)
```

## Tasks

- **Generated lacunae restoration** (Test A Hit@1: 0.817, CER: 0.183)
- **Real lacunae restoration** (Test B char Hit@1: 0.466, span Hit@1: 0.222)

## Contact

Maxim Eremeev, maeremeev@edu.hse.ru