RoFormerBPE

A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts, using a RoFormer architecture with BPE tokenisation. Based on mini-roformer-ancient-rus-v2.

Note: BPE token boundaries do not always align with lacuna boundaries in editorial markup, which inflates span-level CER. For character-level restoration tasks consider using DualEmbLM instead.

Architecture

  • Tokenisation: BPE (Byte Pair Encoding), vocabulary size 50k
  • Architecture: RoFormer encoder with Rotary Position Embeddings (RoPE)
  • Size: 6 layers, hidden size 512, 8 attention heads

Training

The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:

Source Language Word Tokens Link
Birchbark manuscripts Old Novgorodian (mostly) 21,464 gramoty.ru
Epigraphy Old Church Slavonic (mostly) 8,102 epigraphica.ru
DIACU Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian 1,683,307 ACL Anthology
TOROT Old Russian; Church Slavonic 682,430 torottreebank.github.io
Bible (Ponomar) Church Slavonic 603,047 GitHub
Byliny Old Russian (XI–XVII c.) 430,103 rusneb.ru
Pushkin House Old Russian 256,503 lib2.pushkinskijdom.ru
Military Statute (Part 2) Old Russian 49,787 rusneb.ru
NKRYA (historical) Old Russian; Old Rus (XI–XVIII c.) 42,412 ruscorpora.ru

Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained(
    "MaximEremeev/RoFormer-slav",
    trust_remote_code=True,
)
model = AutoModelForMaskedLM.from_pretrained(
    "MaximEremeev/RoFormer-slav",
    trust_remote_code=True,
)

Tasks

  • Generated lacunae restoration (Test A Hit@1: 0.267, CER: 0.839)
  • Real lacunae restoration (Test B char Hit@1: 0.158, span Hit@1: 0.063)

Contact

Maxim Eremeev, maeremeev@edu.hse.ru

Downloads last month
102
Safetensors
Model size
44.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using MaximEremeev/RoFormer-slav 2