RoFormer-slav / README.md
MaximEremeev's picture
Update README.md
334de46 verified
metadata
language:
  - orv
  - cu
tags:
  - masked-language-modeling
  - old-slavonic
  - old-russian
  - birchbark
  - historical-nlp
  - roformer
  - rope
  - bpe
license: apache-2.0

RoFormerBPE

A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts, using a RoFormer architecture with BPE tokenisation. Based on mini-roformer-ancient-rus-v2.

Note: BPE token boundaries do not always align with lacuna boundaries in editorial markup, which inflates span-level CER. For character-level restoration tasks consider using DualEmbLM instead.

Architecture

  • Tokenisation: BPE (Byte Pair Encoding), vocabulary size 50k
  • Architecture: RoFormer encoder with Rotary Position Embeddings (RoPE)
  • Size: 6 layers, hidden size 512, 8 attention heads

Training

The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:

Source Language Word Tokens Link
Birchbark manuscripts Old Novgorodian (mostly) 21,464 gramoty.ru
Epigraphy Old Church Slavonic (mostly) 8,102 epigraphica.ru
DIACU Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian 1,683,307 ACL Anthology
TOROT Old Russian; Church Slavonic 682,430 torottreebank.github.io
Bible (Ponomar) Church Slavonic 603,047 GitHub
Byliny Old Russian (11th–17th c.) 430,103 rusneb.ru
Pushkin House Old Russian 256,503 lib2.pushkinskijdom.ru
Military Statute (Part 2) Old Russian 49,787 rusneb.ru
NKRYA (historical) Old Russian (11th–18th c.) 42,412 ruscorpora.ru

Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained(
    "MaximEremeev/RoFormer-slav",
    trust_remote_code=True,
)
model = AutoModelForMaskedLM.from_pretrained(
    "MaximEremeev/RoFormer-slav",
    trust_remote_code=True,
)

Tasks

  • Generated lacunae restoration (Test A Hit@1: 0.281, CER: 0.839)
  • Real lacunae restoration (Test B char Hit@1: 0.145, span Hit@1: 0.021)

Contact

Maxim Eremeev, maeremeev@edu.hse.ru