| --- |
| language: la |
| library_name: flair |
| license: cc-by-sa-4.0 |
| tags: |
| - flair |
| - token-classification |
| - sequence-tagger |
| - latin |
| - medieval-latin |
| - legal-history |
| - pos-tagging |
| widget: |
| - text: "In nomine sanctae et individuae trinitatis ." |
| --- |
| |
| # Latin Contextual POS Tagger (Flair) |
|
|
| This model is a Part-of-Speech (POS) tagger for Latin, specifically optimized for medieval and early modern legal texts. It uses a Bi-LSTM-CRF architecture based on domain-specific contextual string embeddings. |
|
|
| The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz). |
|
|
| ## Technical Details |
|
|
| - **Architecture:** Bi-LSTM + CRF Sequence Tagger. |
| - **Hidden Size:** 1024 (2 layers). |
| - **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings. |
| - **Data Source:** Corpus of ~1.59M training sentences from medieval texts. |
| - **Accuracy:** 95.88% (Micro F1-score / Accuracy). |
|
|
| ## Data Source and Acknowledgements |
| We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT. |
|
|
|
|
| ## Performance Metrics |
|
|
| Results: |
| - F-score (micro) 0.9588 |
| - F-score (macro) 0.9397 |
| - Accuracy 0.9588 |
|
|
| By class: |
| precision recall f1-score support |
| |
| NOUN 0.9444 0.9480 0.9462 1036164 |
| PUNCT 0.9999 1.0000 1.0000 831460 |
| VERB 0.9657 0.9465 0.9560 810899 |
| CCONJ 0.9833 0.9920 0.9877 463354 |
| PRON 0.9657 0.9631 0.9644 405738 |
| ADP 0.9786 0.9886 0.9835 296947 |
| ADV 0.9300 0.9264 0.9282 285781 |
| ADJ 0.8347 0.8443 0.8395 273219 |
| PROPN 0.9428 0.9623 0.9525 128068 |
| NUM 0.9771 0.9913 0.9842 58389 |
| ORD 0.8362 0.9223 0.8771 8534 |
| ITJ 0.9088 0.8821 0.8953 4554 |
| PART 0.9509 0.9307 0.9407 3202 |
| FM 0.9226 0.8804 0.9010 2491 |
| |
| accuracy 0.9588 4608800 |
| macro avg 0.9386 0.9413 0.9397 4608800 |
| weighted avg 0.9589 0.9588 0.9588 4608800 |
| |
| ### Confusion Matrix |
|  |
|
|
| ### Model Limitations |
|
|
| While the model achieves a high micro-F1 of 95.88%, users should be aware of the following: |
|
|
| * **Adjective/Noun Distinction:** Most misclassifications occur between `ADJ` and `NOUN` due to the morphological overlap common in Latin. |
| * **Ordinal Numbers:** The `ORD` tag (87.71% F1) is occasionally confused with standard adjectives. |
| * **Domain Specificity:** The model is trained on legal and diplomatic corpora; performance may vary slightly on classical poetry or highly informal neo-Latin. |
|
|
| ## Usage |
|
|
| You can use this model directly with the [Flair](https://github.com/flairNLP/flair) library. |
|
|
| ```python |
| from flair.models import SequenceTagger |
| from flair.data import Sentence |
| |
| tagger = SequenceTagger.load("mschonhardt/latin-pos-tagger") |
| |
| sentence = Sentence("In nomine sanctae et individuae trinitatis .") |
| tagger.predict(sentence) |
| |
| for token in sentence: |
| tag = token.get_tag("upos") |
| print(f"{token.text}\t{tag.value}\t{tag.score:.4f}") |
| |
| ``` |
|
|
| ## Training Parameters |
| * Learning Rate: 0.1 |
| * Mini Batch Size: 512 |
| * Max Epochs: 15 |
| * Optimizer: AnnealOnPlateau |
| * Trained on a single GPU. Device: NVIDIA Blackwell 6000 Pro |
|
|
| ## Citation |
|
|
| If you use this model, please cite the specific model DOI and the Flair framework: |
|
|
| ```bibtex |
| @software{schonhardt_michael_2026_latin_pos, |
| author = "Schonhardt, Michael", |
| title = "Latin POS Tagger (Flair)", |
| year = "2026", |
| publisher = "Zenodo", |
| doi = "10.5281/zenodo.18631267", |
| url = "https://huggingface.co/mschonhardt/latin-pos-tagger" |
| } |
| ``` |
|
|
| ```bibtex |
| @inproceedings{akbik-etal-2018-contextual, |
| title = "Contextual String Embeddings for Sequence Labeling", |
| author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland", |
| booktitle = "Proceedings of the 27th International Conference on Computational Linguistics", |
| year = "2018", |
| pages = "1638--1649", |
| publisher = "Association for Computational Linguistics" |
| } |
| ``` |