Ancient Russian RoFormer v2 (Text Restoration Model)

This model is designed for the restoration and reconstruction of damaged Ancient Russian and Old Russian texts (11th – 17th centuries). The model is based on the RoFormer architecture (using Rotary Position Embedding - RoPE), making it highly robust when working with texts that have missing beginnings, torn endings, or completely destroyed fragments.

📌 Model Features

Architecture: RoFormer (RoPE). Excels at relative token positioning in fragmented and discontinuous texts.
Vocabulary (Vocab): Custom BPE tokenizer with 50,000 tokens. Trained with strip_accents=False and lowercase=False to preserve titlos (҃) and historical orthography (ѣ, ѳ, ѵ, ѫ, ѧ).
Context Special Tokens: The model understands document context. Before feeding the text, prepend one of the following domain tags:
- [CTX_DAILY] — Everyday birch bark manuscripts (gramoty)
- [CTX_CHURCH] — Church Slavonic texts (Bible, hagiographies)
- [CTX_LEGAL] — Legal documents (Russkaya Pravda, Sudebniks)
- [CTX_LIT] — Literature and chronicles (Primary Chronicle)
- [CTX_EPIC] — Epics and bylinas
- [CTX_SCIENCE] — Military and technical manuals (e.g., 17th-century military regulations)
- [GAP] — A special tag used to denote completely unreadable or physically destroyed spots in the manuscript.

🧠 Training & Dataset

The model was trained on a unique, carefully balanced corpus of approximately 20 million tokens, compiled from:

Birch bark manuscripts (NRIS, TorOt databases)
Major Chronicle compilations
The Bible and various Church Slavonic texts
17th-century military and technical manuals

Physical Degradation Collator: During training, a custom DataCollator was implemented to simulate real physical damage to historical documents:

Edge Masking (simulating torn edges at the beginning or end of the document).
Span Masking (simulating faded or rubbed-out spots up to 3 words long).
Standard random token masking.

Metrics (Epoch 15):

Perplexity: 10.73
Validation Loss: 2.37
Top-1 Accuracy: 61.01%
Top-5 Accuracy: 73.04%

🚀 How to use

Use the <mask > token for missing words. Always remember to add the context tag at the beginning of the string!

from transformers import pipeline

# Load the pipeline
restorer = pipeline(
    "fill-mask",
    model="AlexSychovUN/ancient-russian-roformer-v2",
    device=0 # Set to -1 for CPU
)

# Example 1: Birch bark manuscript (torn beginning)
text_daily = "[CTX_DAILY] <mask > <mask > ко василью . а серебро ми отдай."
print(restorer(text_daily))
# Output: [{'token_str': ' От', ...}, {'token_str': ' поклон', ...}]

# Example 2: Legal text (Russkaya Pravda)
text_legal = "[CTX_LEGAL] Аже кто оубиеть <mask > , то платити виру 40 гривенъ."
print(restorer(text_legal))
# Output: [{'token_str': ' мужь', ...}, ...]

Downloads last month: 102

Safetensors

Model size

44.9M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

AlexSychovUN
/

mini-roformer-ancient-rus-v2

Ancient Russian RoFormer v2 (Text Restoration Model)

📌 Model Features

🧠 Training & Dataset

Metrics (Epoch 15):

🚀 How to use

Spaces using AlexSychovUN/mini-roformer-ancient-rus-v2 2

Collection including AlexSychovUN/mini-roformer-ancient-rus-v2

Ancient Rus Text Restorer