Ancient Russian RoFormer v2 (Text Restoration Model)

This model is designed for the restoration and reconstruction of damaged Ancient Russian and Old Russian texts (11th – 17th centuries). The model is based on the RoFormer architecture (using Rotary Position Embedding - RoPE), making it highly robust when working with texts that have missing beginnings, torn endings, or completely destroyed fragments.

📌 Model Features

  • Architecture: RoFormer (RoPE). Excels at relative token positioning in fragmented and discontinuous texts.
  • Vocabulary (Vocab): Custom BPE tokenizer with 50,000 tokens. Trained with strip_accents=False and lowercase=False to preserve titlos (҃) and historical orthography (ѣ, ѳ, ѵ, ѫ, ѧ).
  • Context Special Tokens: The model understands document context. Before feeding the text, prepend one of the following domain tags:
    • [CTX_DAILY] — Everyday birch bark manuscripts (gramoty)
    • [CTX_CHURCH] — Church Slavonic texts (Bible, hagiographies)
    • [CTX_LEGAL] — Legal documents (Russkaya Pravda, Sudebniks)
    • [CTX_LIT] — Literature and chronicles (Primary Chronicle)
    • [CTX_EPIC] — Epics and bylinas
    • [CTX_SCIENCE] — Military and technical manuals (e.g., 17th-century military regulations)
    • [GAP] — A special tag used to denote completely unreadable or physically destroyed spots in the manuscript.

🧠 Training & Dataset

The model was trained on a unique, carefully balanced corpus of approximately 20 million tokens, compiled from:

  • Birch bark manuscripts (NRIS, TorOt databases)
  • Major Chronicle compilations
  • The Bible and various Church Slavonic texts
  • 17th-century military and technical manuals

Physical Degradation Collator: During training, a custom DataCollator was implemented to simulate real physical damage to historical documents:

  1. Edge Masking (simulating torn edges at the beginning or end of the document).
  2. Span Masking (simulating faded or rubbed-out spots up to 3 words long).
  3. Standard random token masking.

Metrics (Epoch 15):

  • Perplexity: 10.73
  • Validation Loss: 2.37
  • Top-1 Accuracy: 61.01%
  • Top-5 Accuracy: 73.04%

🚀 How to use

Use the <mask > token for missing words. Always remember to add the context tag at the beginning of the string!

from transformers import pipeline

# Load the pipeline
restorer = pipeline(
    "fill-mask",
    model="AlexSychovUN/ancient-russian-roformer-v2",
    device=0 # Set to -1 for CPU
)

# Example 1: Birch bark manuscript (torn beginning)
text_daily = "[CTX_DAILY] <mask > <mask > ко василью . а серебро ми отдай."
print(restorer(text_daily))
# Output: [{'token_str': ' От', ...}, {'token_str': ' поклон', ...}]

# Example 2: Legal text (Russkaya Pravda)
text_legal = "[CTX_LEGAL] Аже кто оубиеть <mask > , то платити виру 40 гривенъ."
print(restorer(text_legal))
# Output: [{'token_str': ' мужь', ...}, ...]
Downloads last month
160
Safetensors
Model size
44.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Space using AlexSychovUN/mini-roformer-ancient-rus-v2 1

Collection including AlexSychovUN/mini-roformer-ancient-rus-v2