Latin Contextual POS Tagger (Flair)
This model is a Part-of-Speech (POS) tagger for Latin, specifically optimized for medieval and early modern legal texts. It uses a Bi-LSTM-CRF architecture based on domain-specific contextual string embeddings.
The model was developed as part of the projects "Embedding the Past" (LOEWE-Exploration, TU Darmstadt) and "Burchards Dekret Digital" (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
Technical Details
- Architecture: Bi-LSTM + CRF Sequence Tagger.
- Hidden Size: 1024 (2 layers).
- Base Embeddings: Stacked Latin Legal Forward and Backward contextual string embeddings.
- Data Source: Corpus of ~1.59M training sentences from medieval texts.
- Accuracy: 95.88% (Micro F1-score / Accuracy).
Data Source and Acknowledgements
We gratefully acknowledge that the training data originates from the Latin Text Archive (LTA) (Prof. Dr. Bernhard Jussen, Dr. Tim Geelhaar) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.
Performance Metrics
Results:
- F-score (micro) 0.9588
- F-score (macro) 0.9397
- Accuracy 0.9588
By class: precision recall f1-score support
NOUN 0.9444 0.9480 0.9462 1036164
PUNCT 0.9999 1.0000 1.0000 831460
VERB 0.9657 0.9465 0.9560 810899
CCONJ 0.9833 0.9920 0.9877 463354
PRON 0.9657 0.9631 0.9644 405738
ADP 0.9786 0.9886 0.9835 296947
ADV 0.9300 0.9264 0.9282 285781
ADJ 0.8347 0.8443 0.8395 273219
PROPN 0.9428 0.9623 0.9525 128068
NUM 0.9771 0.9913 0.9842 58389
ORD 0.8362 0.9223 0.8771 8534
ITJ 0.9088 0.8821 0.8953 4554
PART 0.9509 0.9307 0.9407 3202
FM 0.9226 0.8804 0.9010 2491
accuracy 0.9588 4608800
macro avg 0.9386 0.9413 0.9397 4608800 weighted avg 0.9589 0.9588 0.9588 4608800
Confusion Matrix
Model Limitations
While the model achieves a high micro-F1 of 95.88%, users should be aware of the following:
- Adjective/Noun Distinction: Most misclassifications occur between
ADJandNOUNdue to the morphological overlap common in Latin. - Ordinal Numbers: The
ORDtag (87.71% F1) is occasionally confused with standard adjectives. - Domain Specificity: The model is trained on legal and diplomatic corpora; performance may vary slightly on classical poetry or highly informal neo-Latin.
Usage
You can use this model directly with the Flair library.
from flair.models import SequenceTagger
from flair.data import Sentence
tagger = SequenceTagger.load("mschonhardt/latin-pos-tagger")
sentence = Sentence("In nomine sanctae et individuae trinitatis .")
tagger.predict(sentence)
for token in sentence:
tag = token.get_tag("upos")
print(f"{token.text}\t{tag.value}\t{tag.score:.4f}")
Training Parameters
- Learning Rate: 0.1
- Mini Batch Size: 512
- Max Epochs: 15
- Optimizer: AnnealOnPlateau
- Trained on a single GPU. Device: NVIDIA Blackwell 6000 Pro
Citation
If you use this model, please cite the specific model DOI and the Flair framework:
@software{schonhardt_michael_2026_latin_pos,
author = "Schonhardt, Michael",
title = "Latin POS Tagger (Flair)",
year = "2026",
publisher = "Zenodo",
doi = "10.5281/zenodo.18631267",
url = "https://huggingface.co/mschonhardt/latin-pos-tagger"
}
@inproceedings{akbik-etal-2018-contextual,
title = "Contextual String Embeddings for Sequence Labeling",
author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
year = "2018",
pages = "1638--1649",
publisher = "Association for Computational Linguistics"
}
- Downloads last month
- -
