language:
- grc
- el
- en
license: cc-by-4.0
library_name: transformers
tags:
- word-alignment
- ancient-greek
- modern-greek
- simalign
- xlm-roberta
- homer
- iliad
- digital-humanities
base_model: UGARIT/grc-alignment
pipeline_tag: feature-extraction
Dragoman: Multilingual Greek Word Alignment
Multilingual word alignment model for Ancient Greek (grc), Modern Greek (el), and English (en). Fine-tuned from UGARIT/grc-alignment on Iliad parallel text with contrastive alignment training.
Dragoman extends UGARIT's AG-EN alignment to also handle AG-MG (Ancient to Modern Greek) alignment, with significant improvements on both axes. The model is designed for the Iliad Parallel Reader but generalizes to any AG-EN or AG-MG parallel text.
Model Details
- Base model: UGARIT/grc-alignment (XLM-RoBERTa, 278M params)
- Fine-tuning: Contrastive alignment loss (InfoNCE) on layer 8 embeddings
- Lemma head: ~200K param disambiguation module trained alongside alignment
- Training data: ~104K pairs from Iliad silver standard, Cunliffe + LSJ lexicons, Perseus prose, NT (OpenGNT), UGARIT gold, AG-MG gold
- Training time: ~4-5 hours on RTX 2080 Ti
- License: CC-BY-4.0 (following UGARIT)
Intended Use
Word-level alignment between:
- Ancient Greek ↔ English translations
- Ancient Greek ↔ Modern Greek translations (including katharevousa)
Primary use case: parallel text readers, digital humanities tools, and Greek NLP pipelines. The model produces word-level alignment pairs via simalign's itermax algorithm.
Alignment Results
AG-EN
| Model | Precision | Recall | F1 |
|---|---|---|---|
| UGARIT/grc-alignment (base) | 0.586 | 0.723 | 0.648 |
| Dragoman + post-processing | 0.708 | 0.908 | 0.796 |
AG-MG
| Model | Precision | Recall | F1 |
|---|---|---|---|
| UGARIT/grc-alignment (base) | 0.492 | 0.812 | 0.613 |
| Dragoman + post-processing | 0.726 | 0.949 | 0.823 |
Evaluated on AG-MG gold standard (Iliad book 2, held back from training).
AER (vs UGARIT gold standard)
Raw model output (no post-processing), standard AER (Och & Ney 2003):
| Iliad (148 sent) | Prose (126 sent) | All (274 sent) | |
|---|---|---|---|
| UGARIT base | 19.47% | 24.71% | 22.59% |
| Dragoman | 20.39% | 26.92% | 24.27% |
Prose portion (Plato's Crito, Xenophon) held out from training.
Latest Pipeline Output
| Axis | Alignments |
|---|---|
| AG-EN (post-processed) | 128,695 |
| AG-MG (post-processed) | 94,653 |
Training Data
| Source | Pairs | Description |
|---|---|---|
| EN silver standard | ~15,700 | Passage-level AG-EN pairs from model output + matrix fallback |
| MG silver standard | ~15,700 | AG-MG pairs from base model + 7 post-processing heuristics |
| Cunliffe lexicon | ~24,100 | AG lemma + EN gloss pairs from Cunliffe's Homeric dictionary |
| Perseus prose | 20,000 | Sentence-aligned Plato, Xenophon, Herodotus, etc. via ancient-greek-datasets |
| NT (OpenGNT) | 26,600 | Clause-aligned Greek NT + English via OpenGNT |
| UGARIT gold | 148 | Hand-annotated AG-EN, Iliad portion only (Palladino et al. 2023) |
| AG-MG gold | 1,720 | AG-MG pairs from Iliad books 1, 6, 18 (book 2 held for eval) |
Training uses all sources simultaneously with contrastive loss. The model learns cross-lingual representations where aligned words (in any language pair) have similar embeddings at layer 8.
Post-Processing Pipeline
Raw model output is enhanced by a multi-pass post-processing pipeline. These heuristics are applied at inference time, not during training.
AG-MG (7 passes)
- Deduplication — one AG word per MG word, prefer same-line
- Identical/proper noun matching — surface form or 4+ char prefix match
- Lemma matching — align words sharing any candidate lemma via Dilemma verbose mode
- Proper noun misalignment fix — reassign MG proper nouns to correct AG proper nouns
- Cross-boundary recovery — match unaligned words on adjacent lines
- Wiktionary AG→MG pull — AG lemma Greek glosses match unaligned MG words
- Wiktionary MG→AG pull — MG word looked up in Wiktionary, Greek glosses matched against AG (with katharevousa fallback to AG Wiktionary)
AG-EN (14 passes + scored matrix)
Uses a scored AlignmentMatrix where each AG word has ranked EN candidates. Heuristic-added pairs score 3.0, raw model pairs 1.0. Displaced model pairs are suppressed but recover as fallbacks when both sides are unaligned after all passes. Three dictionary sources (Cunliffe, LSJ, Wiktionary) provide gloss-based signal at different score levels, with shared suffix stripping and hyphen-part matching.
- Patronymic expansion (Ἀτρεΐδης → "son of Atreus")
- Compound verb pull (προιάπτω → "sent forth")
- Case-preposition pull (genitive → "of")
- Bookending (fill gaps between aligned endpoints)
- Epithet/vocabulary pull (15 Murray-specific translations)
- Cunliffe lexicon pull (line-cited definitions)
- Cunliffe short gloss pull
- LSJ line-cited pull (8,227 Iliad lines indexed from LSJLogeion)
- LSJ short gloss pull (117K entries)
- Wiktionary EN gloss pull (first 5 glosses, skips paradigm noise)
- Multiword expression pull
- Same-word propagation
- Cross-boundary recovery
- Matrix fallback recovery (suppressed pairs where both sides unaligned)
Lemmatization Integration
Dragoman integrates with Dilemma, a Greek lemmatizer covering Modern Greek, Ancient Greek, and Medieval Greek with a 5.2M form lookup table. Dilemma's verbose mode returns multiple candidate lemmas with metadata (language, proper noun status), which the post-processing pipeline uses for cross-matching.
The model includes a lemma disambiguation head (lemma_head.pt) — a small
(~200K param) module trained alongside the alignment loss on 52,978 ambiguous
targets from the Perseus treebank. It uses Dragoman's contextual embeddings
to pick among Dilemma's candidates for ambiguous forms like ἔρις (strife)
vs Ἔρις (the goddess).
Lemmatization Results (Dilemma, standalone)
| Benchmark | Tokens | Accuracy |
|---|---|---|
| DiGreC treebank (context heuristics) | 118,894 | 83.9% |
| Iliad treebank (context + resolve_articles) | 112,653 | 82.9% |
Usage
Word alignment with simalign
from simalign import SentenceAligner
aligner = SentenceAligner(
model="ciscoriordan/dragoman",
token_type="bpe",
matching_methods="i", # itermax
device="cuda",
layer=8,
)
# AG-EN alignment
ag = ["μῆνιν", "ἄειδε", "θεὰ", "Πηληϊάδεω", "Ἀχιλῆος"]
en = ["Sing", "O", "goddess", "the", "wrath", "of", "Achilles"]
result = aligner.get_word_aligns(ag, en)
print(result["itermax"])
# [(0, 4), (1, 0), (2, 2), (4, 6)]
# μῆνιν-wrath, ἄειδε-Sing, θεὰ-goddess, Ἀχιλῆος-Achilles
# AG-MG alignment
mg = ["Ψάλλε", "θεά", "τον", "τρομερό", "θυμό", "του", "Αχιλλέα"]
result = aligner.get_word_aligns(ag, mg)
print(result["itermax"])
# [(0, 4), (1, 0), (2, 1), (4, 6)]
With post-processing (full pipeline)
# See iliad-align repository for the full pipeline:
# https://github.com/ciscoriordan/iliad-align
#
# python run_alignments.py --axis mg --threshold 0.7
# python align_words.py 1 24
Limitations
- Optimized for Homer: trained exclusively on Iliad text. Performance on other AG texts (Plato, Thucydides, NT Greek) is untested.
- Post-processing is Iliad-specific: patronymic pull, epithet tables, and Cunliffe/LSJ lexicons are Homer-specific heuristics.
- Raw model quality: without post-processing, alignment F1 is lower (~0.65 for EN, ~0.61 for MG with base model). The post-processing pipeline contributes significantly to the final numbers.
- MG text: tested only on the Polylas translation (1875, literary katharevousa/demotic). Modern prose translations may differ.
Related Work
- UGARIT/grc-alignment (Palladino et al. 2023): Base model for AG-EN word alignment. Dragoman extends this with AG-MG support, additional training data, and post-processing.
- Dilemma (Riordan 2026): Greek lemmatizer with 5.2M form lookup + character transformer. Provides lemma matching for alignment post-processing.
- simalign (Jalili Sabet et al. 2020): Word alignment tool using contextual embeddings. Dragoman uses simalign's itermax algorithm.
- GreTa/PhilTa (Celano 2025): State-of-the-art AG morphosyntactic parsing. Context-aware lemmatization at 95.6% F1, but AG-only and requires dedicated inference.
Citation
@misc{dragoman2026,
title={Dragoman: Multilingual Word Alignment for Ancient and Modern Greek},
author={Riordan, Francisco},
year={2026},
url={https://huggingface.co/ciscoriordan/dragoman}
}
Acknowledgments
- Training data from the Iliad Parallel Reader project
- Base model from UGARIT (Palladino et al.)
- Lexicon data from Cunliffe's Lexicon of the Homeric Dialect, Liddell-Scott-Jones via LSJLogeion, and Wiktionary via kaikki.org
- Ancient Greek treebank data from the Perseus Digital Library
- Modern Greek translation by Iakovos Polylas (1875), revised by Francisco Riordan
- English translation by A.T. Murray (1924), revised by Francisco Riordan