MaximEremeev commited on
Commit
8aff420
·
verified ·
1 Parent(s): 7618c4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -2
README.md CHANGED
@@ -27,7 +27,20 @@ hidden space before being passed to the transformer encoder.
27
 
28
  ## Training
29
 
30
- Trained on a corpus of (MLM probability 8%, span masking, edge masking, random gap augmentation).
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ## Usage
33
 
@@ -49,7 +62,7 @@ model = AutoModelForMaskedLM.from_pretrained(
49
 
50
  If you use this model, please cite:
51
  ```
52
- @thesis{...,
53
  title = {Automatic Restoration and Analysis of Birchbark Manuscripts},
54
  author = {Maxim Eremeev},
55
  year = {2026},
 
27
 
28
  ## Training
29
 
30
+ The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:
31
+
32
+ | Source | Language | Tokens | Link |
33
+ |--------|----------|--------|------|
34
+ | Birchbark manuscripts | Old Novgorodian | — | [gramoty.ru](https://gramoty.ru), [epigraphica.ru](https://epigraphica.ru) |
35
+ | DIACU | Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian | 1,683,307 | [ACL Anthology](https://aclanthology.org/2025.bsnlp-1.12/) |
36
+ | TOROT | Old Russian; Church Slavonic | 682,430 | [torottreebank.github.io](https://torottreebank.github.io) |
37
+ | Bible (Ponomar) | Church Slavonic | 603,047 | [GitHub](https://github.com/typiconman/ponomar/tree/master/Ponomar/languages/cu/bible/elis) |
38
+ | Byliny | Old Russian (XI–XVII c.) | 430,103 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_003636356/) |
39
+ | Pushkin House | Old Russian | 256,503 | [lib2.pushkinskijdom.ru](https://lib2.pushkinskijdom.ru) |
40
+ | Military Statute (Part 2) | Old Russian | 49,787 | [rusneb.ru](https://rusneb.ru/catalog/000199_000009_004093983/) |
41
+ | NKRYA (historical) | Old Russian; Old Rus (XI–XVIII c.) | 42,412 | [ruscorpora.ru](https://ruscorpora.ru) |
42
+
43
+ Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.
44
 
45
  ## Usage
46
 
 
62
 
63
  If you use this model, please cite:
64
  ```
65
+ @thesis{
66
  title = {Automatic Restoration and Analysis of Birchbark Manuscripts},
67
  author = {Maxim Eremeev},
68
  year = {2026},