DualEmbLM

A masked language model trained from scratch on Old East Slavic and Old Church Slavonic texts, with dual character-level + word-level embeddings.

Architecture

DualEmbLM combines:

  • Character-level tokenisation (1 character = 1 token) β€” enables precise lacuna restoration at the character level
  • Word-level context embeddings β€” provides morphological and lexical context via a 50k word vocabulary
  • Transformer encoder (BERT architecture, trained from scratch) β€” 6 layers, hidden size 512, 8 attention heads

The dual embeddings are concatenated and projected into the shared hidden space before being passed to the transformer encoder.

Training

The model was trained on a corpus of Old Russian and Church Slavonic texts assembled from the following sources:

Source Language Word Tokens Link
Birchbark manuscripts Old Novgorodian (mostly) 19,045 gramoty.ru
Epigraphy Old Church Slavonic (mostly) 7,095 epigraphica.ru
DIACU Old Church Slavonic; Church Slavonic (Old Russian, Middle Bulgarian, Serbian, Resava recensions); Middle Russian 1,588,323 ACL Anthology
TOROT Old Russian; Church Slavonic 603,047 torottreebank.github.io
Bible (Ponomar) Church Slavonic 682,430 GitHub
Byliny Old Russian (11th–17th c.) 42,412 rusneb.ru
Pushkin House Old Russian 430,103 lib2.pushkinskijdom.ru
Military Statute (Part 2) Old Russian 49,787 rusneb.ru
NKRYA (historical) Old Russian (11th–18th c.), Old Novgorodian 327,315 ruscorpora.ru

Masking details: MLM probability 8%, span masking, edge masking, random gap augmentation.

Usage

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(
    "MaximEremeev/DualEmb-slav",
    trust_remote_code=True,
)

Tasks

  • Generated lacunae restoration (Test A Hit@1: 0.822, CER: 0.179)
  • Real lacunae restoration (Test B char Hit@1: 0.47, span Hit@1: 0.232)

Contact

Maxim Eremeev, maeremeev@edu.hse.ru

Downloads last month
33
Safetensors
Model size
29M params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Spaces using MaximEremeev/DualEmb-slav 3