--- language: la library_name: flair license: cc-by-sa-4.0 tags: - flair - token-classification - sequence-tagger - latin - medieval-latin - legal-history - pos-tagging widget: - text: "In nomine sanctae et individuae trinitatis ." --- # Latin Contextual POS Tagger (Flair) This model is a Part-of-Speech (POS) tagger for Latin, specifically optimized for medieval and early modern legal texts. It uses a Bi-LSTM-CRF architecture based on domain-specific contextual string embeddings. The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz). ## Technical Details - **Architecture:** Bi-LSTM + CRF Sequence Tagger. - **Hidden Size:** 1024 (2 layers). - **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings. - **Data Source:** Corpus of ~1.59M training sentences from medieval texts. - **Accuracy:** 95.88% (Micro F1-score / Accuracy). ## Data Source and Acknowledgements We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT. ## Performance Metrics Results: - F-score (micro) 0.9588 - F-score (macro) 0.9397 - Accuracy 0.9588 By class: precision recall f1-score support NOUN 0.9444 0.9480 0.9462 1036164 PUNCT 0.9999 1.0000 1.0000 831460 VERB 0.9657 0.9465 0.9560 810899 CCONJ 0.9833 0.9920 0.9877 463354 PRON 0.9657 0.9631 0.9644 405738 ADP 0.9786 0.9886 0.9835 296947 ADV 0.9300 0.9264 0.9282 285781 ADJ 0.8347 0.8443 0.8395 273219 PROPN 0.9428 0.9623 0.9525 128068 NUM 0.9771 0.9913 0.9842 58389 ORD 0.8362 0.9223 0.8771 8534 ITJ 0.9088 0.8821 0.8953 4554 PART 0.9509 0.9307 0.9407 3202 FM 0.9226 0.8804 0.9010 2491 accuracy 0.9588 4608800 macro avg 0.9386 0.9413 0.9397 4608800 weighted avg 0.9589 0.9588 0.9588 4608800 ### Confusion Matrix ![Confusion Matrix](confusion_matrix.png) ### Model Limitations While the model achieves a high micro-F1 of 95.88%, users should be aware of the following: * **Adjective/Noun Distinction:** Most misclassifications occur between `ADJ` and `NOUN` due to the morphological overlap common in Latin. * **Ordinal Numbers:** The `ORD` tag (87.71% F1) is occasionally confused with standard adjectives. * **Domain Specificity:** The model is trained on legal and diplomatic corpora; performance may vary slightly on classical poetry or highly informal neo-Latin. ## Usage You can use this model directly with the [Flair](https://github.com/flairNLP/flair) library. ```python from flair.models import SequenceTagger from flair.data import Sentence tagger = SequenceTagger.load("mschonhardt/latin-pos-tagger") sentence = Sentence("In nomine sanctae et individuae trinitatis .") tagger.predict(sentence) for token in sentence: tag = token.get_tag("upos") print(f"{token.text}\t{tag.value}\t{tag.score:.4f}") ``` ## Training Parameters * Learning Rate: 0.1 * Mini Batch Size: 512 * Max Epochs: 15 * Optimizer: AnnealOnPlateau * Trained on a single GPU. Device: NVIDIA Blackwell 6000 Pro ## Citation If you use this model, please cite the specific model DOI and the Flair framework: ```bibtex @software{schonhardt_michael_2026_latin_pos, author = "Schonhardt, Michael", title = "Latin POS Tagger (Flair)", year = "2026", publisher = "Zenodo", doi = "10.5281/zenodo.18631267", url = "https://huggingface.co/mschonhardt/latin-pos-tagger" } ``` ```bibtex @inproceedings{akbik-etal-2018-contextual, title = "Contextual String Embeddings for Sequence Labeling", author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland", booktitle = "Proceedings of the 27th International Conference on Computational Linguistics", year = "2018", pages = "1638--1649", publisher = "Association for Computational Linguistics" } ```