Mini-BERT for Ancient Rus Texts (V2)

This is a custom-trained Masked Language Model (MLM) designed specifically for the restoration and analysis of ancient Russian and Old Church Slavonic texts. It was trained from scratch on a highly balanced, multi-domain corpus of historical documents.

📊 Model Architecture

  • Type: Mini-BERT (from scratch)
  • Hidden Size: 512
  • Layers: 6
  • Attention Heads: 8
  • Parameters: ~30M
  • Block Size: 256 tokens

📚 Training Data

The model was trained on a custom, meticulously cleaned corpus of 11.5 million tokens. The dataset is unique because it is balanced across 6 distinct historical domains, using special context tags:

  • [CTX_CHURCH] (27.4%): Religious texts, prayers, Bible.
  • [CTX_DAILY] (26.6%): Everyday correspondence, Novgorod birch bark letters.
  • [CTX_LIT] (17.6%): Chronicles, literature (e.g., Tale of Igor's Campaign).
  • [CTX_LEGAL] (12.9%): Law codes (Russkaya Pravda), court documents.
  • [CTX_EPIC] (12.3%): Epics, bylinas, folklore.
  • [CTX_SCIENCE] (2.9%): Ancient medical texts, herbals.

🏆 Training Results

Trained for 15 epochs with Cosine Learning Rate Scheduler (5e-4 peak LR).

  • Final Validation Loss: 3.4683
  • Perplexity: 32.08

The model demonstrates a deep understanding of historical grammar, cases (e.g., automatically predicting Old Russian dative cases like "гюргю" / "юрью"), and domain-specific vocabulary.

🚀 How to Use

You must include one of the 6 context tags at the beginning of your text for the model to correctly identify the style.

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="AlexSychovUN/mini-bert-ancient-rus-v2",
    tokenizer="AlexSychovUN/mini-bert-ancient-rus-v2"
)

# Example: Birch bark letter (Daily context)
text = "[CTX_DAILY] Поклонъ ѿ бориса ко [MASK] съ бг҃омъ."
print(fill_mask(text))
# Top prediction: "брату", "гюргю"

# Example: Epic context
text_epic = "[CTX_EPIC] Гой еси ты добрый [MASK] , куда путь держишь?"
print(fill_mask(text_epic))
# Top prediction: "конь" (94.9%)
Downloads last month
17
Safetensors
Model size
34.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Space using AlexSychovUN/mini-bert-ancient-rus-v2 1

Collection including AlexSychovUN/mini-bert-ancient-rus-v2