LatinCy Diacritics: Polytonic Greek Diacritics Restorer
A CANINE-S character-level model that restores polytonic diacritics to undiacriticized Ancient Greek text. Fine-tuned on Universal Dependencies Ancient Greek treebanks.
Model Details
- Architecture: CANINE-S encoder + linear classification head
- Base model: google/canine-s
- Task: Per-character multi-class classification (~148 polytonic output classes)
- Training data: 27,219 sentences from UD Ancient Greek treebanks (PROIEL, Perseus, PTNK)
- Language: Ancient Greek (grc)
Usage
pip install https://huggingface.co/latincy/latincy-diacritics/resolve/main/latincy_diacritics-0.1.0-py3-none-any.whl
from latincy_diacritics import DiacriticRestorer
restorer = DiacriticRestorer()
restorer.restore("εν αρχη ην ο λογος")
# → 'ἐν ἀρχῇ ἦν ὁ λόγος'
Evaluation
Evaluated on the UD Ancient Greek test set (2,763 sentences, 116,129 mutable characters).
| Metric | Score |
|---|---|
| Mutable accuracy | 96.42% |
| Sentence accuracy | 44.80% |
Mutable accuracy measures prediction on vowels and rho only (characters that carry diacritics). Overall accuracy is inflated by trivial identity predictions on consonants and punctuation.
Per-Vowel Accuracy
| Vowel | Accuracy | Count | Top confusion |
|---|---|---|---|
| ρ | 99.93% | 8,045 | ῤ↔ῥ |
| ο | 98.06% | 19,543 | ό↔ο |
| ε | 97.76% | 20,141 | έ↔ε |
| α | 96.50% | 23,190 | ά↔α, ἁ↔ἀ |
| υ | 96.18% | 10,909 | ύ↔υ, ῦ↔υ |
| ι | 94.53% | 19,404 | ι↔ί, ι↔ῖ |
| η | 93.75% | 8,373 | η↔ὴ, η↔ή |
| ω | 92.17% | 6,524 | ῶ↔ω |
Limitations
- Accent placement is the main error source. Deciding whether a vowel carries an accent requires morphological knowledge (verb forms, enclitics, proclitics) that exceeds pure character context.
- Breathing marks (smooth vs. rough) account for systematic but less frequent errors.
- Mid-length words (5-8 characters) are hardest: enough mutable positions for ambiguity, but insufficient length for morphological suffixes to disambiguate.
Training
Fine-tuned for 30 epochs on a single L40S GPU using AdamW (lr=2e-5) with cosine schedule and warmup. Best checkpoint selected by dev set mutable accuracy.
Authors
- Patrick J. Burns (@diyclassics)
- Virgilia Antonucci
Citation
@software{burns_latincy_diacritics,
author = {Burns, Patrick J. and Antonucci, Virgilia},
title = {LatinCy Diacritics},
url = {https://github.com/diyclassics/latincy-diacritics},
license = {MIT}
}