Latin Lemmatizer (Flair)

This model is a specialized Sequence-to-Sequence (Seq2Seq) Lemmatizer for Latin. Unlike simple lookup-based lemmatizers, this model uses an encoder-decoder architecture with attention to "translate" inflected Latin word forms into their dictionary headwords (lemmas), making it highly effective for the complex morphology of medieval texts.

The model was developed as part of the projects "Embedding the Past" (LOEWE-Exploration, TU Darmstadt) and "Burchards Dekret Digital" (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).

Technical Details

Architecture: Seq2Seq Lemmatizer (RNN-based encoder-decoder with attention as implemented by Flair Lemmatizer).
Hidden Size: 2048 (4 layers).
Base Embeddings: Stacked Latin Legal Forward and Backward contextual string embeddings.
Data Source: ~1.59M sentences from medieval texts.
Beam Size: 1.

Data Source and Acknowledgements

We gratefully acknowledge that the training data originates from the Latin Text Archive (LTA) (Prof. Dr. Bernhard Jussen, Dr. Tim Geelhaar) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.

Evaluation

Token-level exact-match lemma accuracy on the held-out test split is 95.93% (~4.6M tokens; 199,037 sentences).
This score is computed with flair.models.Lemmatizer.evaluate() and corresponds to the proportion of tokens where the predicted lemma string exactly matches the gold lemma string.

Important notes regarding evaluation

Micro-F1 vs. accuracy: Flair's evaluation report shows micro-/macro-F1. For lemmatization, these are derived by treating each unique lemma string as a "class". In this setting, micro-F1 ~ accuracy by construction, while macro-F1 is typically much lower due to the extremely long-tailed lemma inventory and is not the primary lemmatization metric.
Filtered tokens: Tokens with missing/invalid gold lemmas (e.g., None or <UNK>) are excluded by the dataset loader. This can make results slightly optimistic compared to evaluating on the raw, unfiltered token stream.
Tokenization conventions: The model is sensitive to the tokenization used during training (e.g., punctuation separated by spaces). Different tokenization may reduce accuracy.
Decoding and length limits: Decoding is greedy (beam_size=1), and very long tokens may be affected by the model's maximum sequence length settings.
Domain shift: Trained on medieval Latin. Performance may drop on texts with different orthography/lexicon (e.g., classical poetry, heavily abbreviated editions, mixed-language passages, etc.).

Performance Metrics

The model was evaluated on a test set of 199,037 sentences (~4.6M tokens).

Metric	Score
Token exact-match accuracy	95.93%

Usage

You can use this model with the Flair library.

See notebook in this repo.

Training Configuration

Learning Rate: 0.05
Mini Batch Size: 768 (with AMP enabled)
Max Epochs: 15
Optimizer: Standard SGD (with Flair's ModelTrainer)
Character Dictionary: Custom-built covering the Latin alphabet and special diplomatic characters.

Citation

If you use this model, please cite the specific model DOI and the Flair framework:

@software{schonhardt_michael_2026_latin_lemma,
  author = "Schonhardt, Michael",
  title = "Latin Lemmatizer (Flair)",
  year = "2026",
  publisher = "Zenodo",
  doi = "10.5281/zenodo.18632650",
  url = "https://huggingface.co/mschonhardt/latin-lemmatizer"
}

@inproceedings{akbik-etal-2018-contextual,
    title = "Contextual String Embeddings for Sequence Labeling",
    author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
    booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
    year = "2018",
    pages = "1638--1649",
    publisher = "Association for Computational Linguistics"
}

Downloads last month: -