Mini-BERT for Ancient Rus Texts (V2)
This is a custom-trained Masked Language Model (MLM) designed specifically for the restoration and analysis of ancient Russian and Old Church Slavonic texts. It was trained from scratch on a highly balanced, multi-domain corpus of historical documents.
📊 Model Architecture
- Type: Mini-BERT (from scratch)
- Hidden Size: 512
- Layers: 6
- Attention Heads: 8
- Parameters: ~30M
- Block Size: 256 tokens
📚 Training Data
The model was trained on a custom, meticulously cleaned corpus of 11.5 million tokens. The dataset is unique because it is balanced across 6 distinct historical domains, using special context tags:
[CTX_CHURCH](27.4%): Religious texts, prayers, Bible.[CTX_DAILY](26.6%): Everyday correspondence, Novgorod birch bark letters.[CTX_LIT](17.6%): Chronicles, literature (e.g., Tale of Igor's Campaign).[CTX_LEGAL](12.9%): Law codes (Russkaya Pravda), court documents.[CTX_EPIC](12.3%): Epics, bylinas, folklore.[CTX_SCIENCE](2.9%): Ancient medical texts, herbals.
🏆 Training Results
Trained for 15 epochs with Cosine Learning Rate Scheduler (5e-4 peak LR).
- Final Validation Loss: 3.4683
- Perplexity: 32.08
The model demonstrates a deep understanding of historical grammar, cases (e.g., automatically predicting Old Russian dative cases like "гюргю" / "юрью"), and domain-specific vocabulary.
🚀 How to Use
You must include one of the 6 context tags at the beginning of your text for the model to correctly identify the style.
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="AlexSychovUN/mini-bert-ancient-rus-v2",
tokenizer="AlexSychovUN/mini-bert-ancient-rus-v2"
)
# Example: Birch bark letter (Daily context)
text = "[CTX_DAILY] Поклонъ ѿ бориса ко [MASK] съ бг҃омъ."
print(fill_mask(text))
# Top prediction: "брату", "гюргю"
# Example: Epic context
text_epic = "[CTX_EPIC] Гой еси ты добрый [MASK] , куда путь держишь?"
print(fill_mask(text_epic))
# Top prediction: "конь" (94.9%)
- Downloads last month
- 17