Dragoman: Diachronic Greek Word Alignment

Diachronic word alignment model for Ancient Greek (grc), Modern Greek (el), and English (en). Fine-tuned from UGARIT/grc-alignment on Iliad parallel text with contrastive alignment training.

Dragoman extends UGARIT's AG-EN alignment to also handle AG-MG (Ancient to Modern Greek) alignment, with significant improvements on both axes. The model is designed for Iliad Aligned but generalizes to any AG-EN or AG-MG parallel text.

Model Details

Base model: UGARIT/grc-alignment (XLM-RoBERTa, 278M params)
Fine-tuning: Contrastive alignment loss (InfoNCE) on layer 8 embeddings
Lemma head: ~200K param disambiguation module trained alongside alignment
Training data: ~120K pairs from Iliad silver standard, Cunliffe + LSJ lexicons, Perseus prose, NT (OpenGNT), UGARIT gold, AG-MG gold, Attic drama (15.8K theatre pairs)
Training time: ~20 min on RTX 2080 Ti (training only; full pipeline including inference is ~2 hours)
License: CC-BY-4.0 (following UGARIT)

Intended Use

Word-level alignment between:

Ancient Greek ↔ English translations
Ancient Greek ↔ Modern Greek translations (including Katharevousa)

Primary use case: parallel text readers, digital humanities tools, and Greek NLP pipelines. The model produces word-level alignment pairs via simalign's argmax algorithm.

Alignment Results

AG-EN

Model	Precision	Recall	F1
UGARIT/grc-alignment (base)	0.586	0.723	0.648
Dragoman + post-processing	0.708	0.908	0.796

AG-MG

Model	Precision	Recall	F1
UGARIT/grc-alignment (base, t=0.0)	0.384	0.634	0.478
Dragoman v2 (Homer only, t=0.81)	0.480	0.823	0.606
Dragoman v3 + post-processing (Homer + drama)	0.680	0.948	0.792

v3 evaluated on AG-MG drama gold standard (83 lines across 9 Attic plays, held back from training). Homer gold (Iliad book 2): F1 0.549 (precision 0.387, recall 0.949).

AER (vs UGARIT gold standard)

Raw model output (no post-processing), standard AER (Och & Ney 2003):

	Iliad (148 sent)	Prose (126 sent)	All (274 sent)
UGARIT base	19.47%	24.71%	22.59%
Dragoman	20.39%	26.92%	24.27%

Prose portion (Plato's Crito, Xenophon) held out from training.

Latest Pipeline Output

Axis	Alignments
AG-EN (post-processed)	118,681
AG-MG (post-processed)	107,223

Training Data

Source	Pairs	Description
EN silver standard	~15,700	Passage-level AG-EN pairs from model output + matrix fallback
MG silver standard	~15,700	AG-MG pairs from base model + 7 post-processing heuristics
Cunliffe lexicon	~24,100	AG lemma + EN gloss pairs from Cunliffe's Homeric dictionary
Perseus prose	20,000	Sentence-aligned Plato, Xenophon, Herodotus, etc. via ancient-greek-datasets
NT (OpenGNT)	26,600	Clause-aligned Greek NT + English via OpenGNT
UGARIT gold	148	Hand-annotated AG-EN, Iliad portion only (Palladino et al. 2023)
AG-MG gold	1,720	AG-MG pairs from Iliad books 1, 6, 18 (book 2 held for eval)
Attic drama	15,800	Word-aligned AG-MG pairs from Sophocles, Euripides, Aristophanes

Training uses all sources simultaneously with contrastive loss. The model learns cross-lingual representations where aligned words (in any language pair) have similar embeddings at layer 8.

Post-Processing Pipeline

Raw model output is enhanced by a multi-pass post-processing pipeline. These heuristics are applied at inference time, not during training.

AG-MG (15 passes)

Deduplication - one AG word per MG word, prefer same-line
Identical/proper noun matching - surface form or 4+ char prefix match
Lemma matching - align words sharing any candidate lemma via Dilemma verbose mode
Cross-line lemma fix - correct misalignments where lemma match exists on a closer line
Reverse cross-line fix - fix MG words pulled to wrong AG line
Proper noun misalignment fix - reassign MG proper nouns to correct AG proper nouns
Cross-boundary recovery - match unaligned words on adjacent lines
Wiktionary AG→MG pull - AG lemma Greek glosses match unaligned MG words
Wiktionary MG→AG pull - MG word looked up in Wiktionary, Greek glosses matched against AG (with Katharevousa fallback)
LSJ Greek glosses pull - AG lemma looked up in lsj.gr Greek-to-Greek definitions
English bridge pull - AG and MG words sharing an EN gloss via Cunliffe/Wiktionary
Cognate misalignment fix - correct pairs where a better cognate match exists nearby
Cognate stem matching - Greek sound-change normalization (β→v, θ→th, αι→ε, etc.) with compound prefix stripping
Epithet equivalences - Homeric epithet/formula matching (e.g. πόδας ὠκύς → γοργοπόδαρος)
Dependency propagation - extend alignments along MG dependency arcs

AG-EN (14 passes + scored matrix)

Uses a scored AlignmentMatrix where each AG word has ranked EN candidates. Heuristic-added pairs score 3.0, raw model pairs 1.0. Displaced model pairs are suppressed but recover as fallbacks when both sides are unaligned after all passes. Three dictionary sources (Cunliffe, LSJ, Wiktionary) provide gloss-based signal at different score levels, with shared suffix stripping and hyphen-part matching.

PhilBerta ensemble - agreement boost (1.5) where Dragoman and PhilBerta agree; PROPN-guarded fallback (0.3) for unaligned AG words
Patronymic expansion (Ἀτρεΐδης → "son of Atreus")
Compound verb pull (προιάπτω → "sent forth")
Case-preposition pull (genitive → "of")
Bookending (fill gaps between aligned endpoints)
Epithet/vocabulary pull (15 Murray-specific translations)
Cunliffe lexicon pull (line-cited definitions)
Cunliffe short gloss pull
LSJ line-cited pull (8,227 Iliad lines indexed from lsj9)
LSJ short gloss pull (117K entries)
Wiktionary EN gloss pull (first 5 glosses, skips paradigm noise)
Multiword expression pull
Same-word propagation
Cross-boundary recovery
Matrix fallback recovery (suppressed pairs where both sides unaligned)

Lemmatization Integration

Dragoman integrates with Dilemma, a Greek lemmatizer covering Modern Greek, Ancient Greek, and Medieval Greek with a 12.5M form lookup table, plus a GPU-optimized POS tagger and dependency parser (96.8% UPOS on AG) available via dilemma[tagger]. Dilemma's verbose mode returns multiple candidate lemmas with metadata (language, proper noun status), which the post-processing pipeline uses for cross-matching.

The alignment pipeline passes the preceding token as prev_word to Dilemma's lemmatize_verbose(), enabling article-agreement disambiguation. When a Greek article (e.g. τῆς, τόν, τῶν) precedes an ambiguous form, Dilemma boosts candidates whose gender/number matches the article. This helps distinguish cases like ἔρις (strife) vs Ἔρις (the goddess Strife) based on the syntactic context. Dilemma also supports a dialect parameter (e.g. dialect="ionic") for Ionic/Epic forms common in Homer, though the pipeline currently uses the default lookup which already covers most Homeric vocabulary via the AG treebank.

The model includes a lemma disambiguation head (lemma_head.pt) - a small (~200K param) module trained alongside the alignment loss on 52,978 ambiguous targets from the Perseus treebank. It uses Dragoman's contextual embeddings to pick among Dilemma's candidates for ambiguous forms like ἔρις (strife) vs Ἔρις (the goddess).

Lemmatization Results (Dilemma, standalone)

Benchmark	Tokens	Accuracy
AG Classical (Sextus Empiricus)	357	99.7%
Byzantine (DBBE gold standard)	8,342	92.7%
Katharevousa (Sathas)	318	95.6%
Demotic MG (triantafyllidis convention)	400	96.0%
DiGreC treebank (equiv-adjusted)	118,894	93.7%

Usage

Word alignment with simalign

from simalign import SentenceAligner

aligner = SentenceAligner(
    model="ciscoriordan/dragoman",
    token_type="bpe",
    matching_methods="a",  # argmax
    device="cuda",
    layer=8,
)

# AG-EN alignment
ag = ["μῆνιν", "ἄειδε", "θεὰ", "Πηληϊάδεω", "Ἀχιλῆος"]
en = ["Sing", "O", "goddess", "the", "wrath", "of", "Achilles"]
result = aligner.get_word_aligns(ag, en)
print(result["inter"])
# [(0, 4), (1, 0), (2, 2), (4, 6)]
# μῆνιν-wrath, ἄειδε-Sing, θεὰ-goddess, Ἀχιλῆος-Achilles

# AG-MG alignment
mg = ["Ψάλλε", "θεά", "τον", "τρομερό", "θυμό", "του", "Αχιλλέα"]
result = aligner.get_word_aligns(ag, mg)
print(result["inter"])
# [(0, 4), (1, 0), (2, 1), (4, 6)]

With post-processing (full pipeline)

# See iliad-align repository for the full pipeline:
# https://github.com/ciscoriordan/iliad-align
#
# python run_alignments.py --axis mg --threshold 0.7
# python align_words.py 1 24

Limitations

Optimized for Homeric and Attic Greek: trained on Iliad text plus Attic drama (Sophocles, Euripides, Aristophanes). Performance on other AG texts (Plato, Thucydides, NT Greek) is untested.
Post-processing is Iliad-specific: patronymic pull, epithet tables, and Cunliffe/LSJ lexicons are Homer-specific heuristics.
Raw model quality: without post-processing, alignment F1 is lower (~0.65 for EN, ~0.61 for MG with base model). The post-processing pipeline contributes significantly to the final numbers.
MG text: tested only on the Polylas translation (1875, literary Katharevousa/demotic). Modern prose translations may differ.

Related Work

UGARIT/grc-alignment (Palladino et al. 2023): Base model for AG-EN word alignment. Dragoman extends this with AG-MG support, additional training data, and post-processing.
Dilemma (Riordan 2026): Greek lemmatizer with 12.5M form lookup + character transformer. Provides lemma matching for alignment post-processing.
simalign (Jalili Sabet et al. 2020): Word alignment tool using contextual embeddings. Dragoman uses simalign's argmax algorithm.
Dilemma tagger (Riordan 2026, dilemma[tagger]): GPU-optimized Greek POS tagger + dependency parser. 96.8% UPOS on AG (Perseus + PROIEL + Gorman), 90.6% on MG. Used for syntactic bonding in redistribution and MG morphological tagging.
GreTa/PhilTa (Celano 2025): State-of-the-art AG morphosyntactic parsing. Context-aware lemmatization at 95.6% F1, but AG-only and requires dedicated inference.

Testing

The repo includes a test suite that validates model card metadata, config, tokenizer, weights, and (optionally) end-to-end alignment inference.

# Fast tests only (no model loading, ~3s)
python -m pytest tests/ -x -v

# All tests including model loading and inference (~10s)
python -m pytest tests/ -x -v --slow

Requires: pytest, tokenizers, torch, safetensors. The slow tests additionally need transformers and simalign.

Citation

@misc{dragoman2026,
  title={Dragoman: Diachronic Word Alignment for Ancient and Modern Greek},
  author={Riordan, Francisco},
  year={2026},
  url={https://huggingface.co/ciscoriordan/dragoman}
}

Acknowledgments

Training data from the Iliad Aligned project
Base model from UGARIT (Palladino et al.)
Lexicon data from Cunliffe's Lexicon of the Homeric Dialect, Liddell-Scott-Jones via lsj9, and Wiktionary via kaikki.org
Ancient Greek treebank data from the Perseus Digital Library
Modern Greek translation by Iakovos Polylas (1875), revised by Francisco Riordan
English translation by A.T. Murray (1924), revised by Francisco Riordan

Downloads last month: 32

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for ciscoriordan/dragoman

Base model

UGARIT/grc-alignment

Finetuned

(6)

this model