Mini-BERT for Ancient Rus Texts (V2)

This is a custom-trained Masked Language Model (MLM) designed specifically for the restoration and analysis of ancient Russian and Old Church Slavonic texts. It was trained from scratch on a highly balanced, multi-domain corpus of historical documents.

📊 Model Architecture

Type: Mini-BERT (from scratch)
Hidden Size: 512
Layers: 6
Attention Heads: 8
Parameters: ~30M
Block Size: 256 tokens

📚 Training Data

The model was trained on a custom, meticulously cleaned corpus of 11.5 million tokens. The dataset is unique because it is balanced across 6 distinct historical domains, using special context tags:

[CTX_CHURCH] (27.4%): Religious texts, prayers, Bible.
[CTX_DAILY] (26.6%): Everyday correspondence, Novgorod birch bark letters.
[CTX_LIT] (17.6%): Chronicles, literature (e.g., Tale of Igor's Campaign).
[CTX_LEGAL] (12.9%): Law codes (Russkaya Pravda), court documents.
[CTX_EPIC] (12.3%): Epics, bylinas, folklore.
[CTX_SCIENCE] (2.9%): Ancient medical texts, herbals.

🏆 Training Results

Trained for 15 epochs with Cosine Learning Rate Scheduler (5e-4 peak LR).

Final Validation Loss: 3.4683
Perplexity: 32.08

The model demonstrates a deep understanding of historical grammar, cases (e.g., automatically predicting Old Russian dative cases like "гюргю" / "юрью"), and domain-specific vocabulary.

🚀 How to Use

You must include one of the 6 context tags at the beginning of your text for the model to correctly identify the style.

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="AlexSychovUN/mini-bert-ancient-rus-v2",
    tokenizer="AlexSychovUN/mini-bert-ancient-rus-v2"
)

# Example: Birch bark letter (Daily context)
text = "[CTX_DAILY] Поклонъ ѿ бориса ко [MASK] съ бг҃омъ."
print(fill_mask(text))
# Top prediction: "брату", "гюргю"

# Example: Epic context
text_epic = "[CTX_EPIC] Гой еси ты добрый [MASK] , куда путь держишь?"
print(fill_mask(text_epic))
# Top prediction: "конь" (94.9%)

Downloads last month: 7

Safetensors

Model size

34.8M params

Tensor type

F32