Ancient Rus Text Restorer
Collection
2 items • Updated
• 2
This model is designed for the restoration and reconstruction of damaged Ancient Russian and Old Russian texts (11th – 17th centuries). The model is based on the RoFormer architecture (using Rotary Position Embedding - RoPE), making it highly robust when working with texts that have missing beginnings, torn endings, or completely destroyed fragments.
strip_accents=False and lowercase=False to preserve titlos (҃) and historical orthography (ѣ, ѳ, ѵ, ѫ, ѧ).[CTX_DAILY] — Everyday birch bark manuscripts (gramoty)[CTX_CHURCH] — Church Slavonic texts (Bible, hagiographies)[CTX_LEGAL] — Legal documents (Russkaya Pravda, Sudebniks)[CTX_LIT] — Literature and chronicles (Primary Chronicle)[CTX_EPIC] — Epics and bylinas[CTX_SCIENCE] — Military and technical manuals (e.g., 17th-century military regulations)[GAP] — A special tag used to denote completely unreadable or physically destroyed spots in the manuscript.The model was trained on a unique, carefully balanced corpus of approximately 20 million tokens, compiled from:
Physical Degradation Collator: During training, a custom DataCollator was implemented to simulate real physical damage to historical documents:
Use the <mask > token for missing words. Always remember to add the context tag at the beginning of the string!
from transformers import pipeline
# Load the pipeline
restorer = pipeline(
"fill-mask",
model="AlexSychovUN/ancient-russian-roformer-v2",
device=0 # Set to -1 for CPU
)
# Example 1: Birch bark manuscript (torn beginning)
text_daily = "[CTX_DAILY] <mask > <mask > ко василью . а серебро ми отдай."
print(restorer(text_daily))
# Output: [{'token_str': ' От', ...}, {'token_str': ' поклон', ...}]
# Example 2: Legal text (Russkaya Pravda)
text_legal = "[CTX_LEGAL] Аже кто оубиеть <mask > , то платити виру 40 гривенъ."
print(restorer(text_legal))
# Output: [{'token_str': ' мужь', ...}, ...]