LatinCy Diacritics: Polytonic Greek Diacritics Restorer

A CANINE-S character-level model that restores polytonic diacritics to undiacriticized Ancient Greek text. Fine-tuned on Universal Dependencies Ancient Greek treebanks.

Model Details

  • Architecture: CANINE-S encoder + linear classification head
  • Base model: google/canine-s
  • Task: Per-character multi-class classification (~148 polytonic output classes)
  • Training data: 27,219 sentences from UD Ancient Greek treebanks (PROIEL, Perseus, PTNK)
  • Language: Ancient Greek (grc)

Usage

pip install https://huggingface.co/latincy/latincy-diacritics/resolve/main/latincy_diacritics-0.1.0-py3-none-any.whl
from latincy_diacritics import DiacriticRestorer

restorer = DiacriticRestorer()
restorer.restore("εν αρχη ην ο λογος")
# → 'ἐν ἀρχῇ ἦν ὁ λόγος'

Evaluation

Evaluated on the UD Ancient Greek test set (2,763 sentences, 116,129 mutable characters).

Metric Score
Mutable accuracy 96.42%
Sentence accuracy 44.80%

Mutable accuracy measures prediction on vowels and rho only (characters that carry diacritics). Overall accuracy is inflated by trivial identity predictions on consonants and punctuation.

Per-Vowel Accuracy

Vowel Accuracy Count Top confusion
ρ 99.93% 8,045 ῤ↔ῥ
ο 98.06% 19,543 ό↔ο
ε 97.76% 20,141 έ↔ε
α 96.50% 23,190 ά↔α, ἁ↔ἀ
υ 96.18% 10,909 ύ↔υ, ῦ↔υ
ι 94.53% 19,404 ι↔ί, ι↔ῖ
η 93.75% 8,373 η↔ὴ, η↔ή
ω 92.17% 6,524 ῶ↔ω

Limitations

  • Accent placement is the main error source. Deciding whether a vowel carries an accent requires morphological knowledge (verb forms, enclitics, proclitics) that exceeds pure character context.
  • Breathing marks (smooth vs. rough) account for systematic but less frequent errors.
  • Mid-length words (5-8 characters) are hardest: enough mutable positions for ambiguity, but insufficient length for morphological suffixes to disambiguate.

Training

Fine-tuned for 30 epochs on a single L40S GPU using AdamW (lr=2e-5) with cosine schedule and warmup. Best checkpoint selected by dev set mutable accuracy.

Authors

Citation

@software{burns_latincy_diacritics,
  author = {Burns, Patrick J. and Antonucci, Virgilia},
  title = {LatinCy Diacritics},
  url = {https://github.com/diyclassics/latincy-diacritics},
  license = {MIT}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using latincy/latincy-diacritics 1