Update README.md

0aa7055 verified 2 months ago

4.51 kB

	---
	language: la
	library_name: flair
	license: cc-by-sa-4.0
	tags:
	- flair
	- token-classification
	- sequence-tagger
	- latin
	- medieval-latin
	- legal-history
	- pos-tagging
	widget:
	- text: "In nomine sanctae et individuae trinitatis ."
	---

	# Latin Contextual POS Tagger (Flair)

	This model is a Part-of-Speech (POS) tagger for Latin, specifically optimized for medieval and early modern legal texts. It uses a Bi-LSTM-CRF architecture based on domain-specific contextual string embeddings.

	The model was developed as part of the projects "Embedding the Past" (LOEWE-Exploration, TU Darmstadt) and "Burchards Dekret Digital" (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur \| Mainz).

	## Technical Details

	- Architecture: Bi-LSTM + CRF Sequence Tagger.
	- Hidden Size: 1024 (2 layers).
	- Base Embeddings: Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings.
	- Data Source: Corpus of ~1.59M training sentences from medieval texts.
	- Accuracy: 95.88% (Micro F1-score / Accuracy).

	## Data Source and Acknowledgements
	We gratefully acknowledge that the training data originates from the [Latin Text Archive (LTA)](http://lta.bbaw.de) (Prof. Dr. Bernhard Jussen, Dr. Tim Geelhaar) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.


	## Performance Metrics

	Results:
	- F-score (micro) 0.9588
	- F-score (macro) 0.9397
	- Accuracy 0.9588

	By class:
	precision recall f1-score support

	NOUN 0.9444 0.9480 0.9462 1036164
	PUNCT 0.9999 1.0000 1.0000 831460
	VERB 0.9657 0.9465 0.9560 810899
	CCONJ 0.9833 0.9920 0.9877 463354
	PRON 0.9657 0.9631 0.9644 405738
	ADP 0.9786 0.9886 0.9835 296947
	ADV 0.9300 0.9264 0.9282 285781
	ADJ 0.8347 0.8443 0.8395 273219
	PROPN 0.9428 0.9623 0.9525 128068
	NUM 0.9771 0.9913 0.9842 58389
	ORD 0.8362 0.9223 0.8771 8534
	ITJ 0.9088 0.8821 0.8953 4554
	PART 0.9509 0.9307 0.9407 3202
	FM 0.9226 0.8804 0.9010 2491

	accuracy 0.9588 4608800
	macro avg 0.9386 0.9413 0.9397 4608800
	weighted avg 0.9589 0.9588 0.9588 4608800

	### Confusion Matrix
	![Confusion Matrix](confusion_matrix.png)

	### Model Limitations

	While the model achieves a high micro-F1 of 95.88%, users should be aware of the following:

	* Adjective/Noun Distinction: Most misclassifications occur between `ADJ` and `NOUN` due to the morphological overlap common in Latin.
	* Ordinal Numbers: The `ORD` tag (87.71% F1) is occasionally confused with standard adjectives.
	* Domain Specificity: The model is trained on legal and diplomatic corpora; performance may vary slightly on classical poetry or highly informal neo-Latin.

	## Usage

	You can use this model directly with the [Flair](https://github.com/flairNLP/flair) library.

	```python
	from flair.models import SequenceTagger
	from flair.data import Sentence

	tagger = SequenceTagger.load("mschonhardt/latin-pos-tagger")

	sentence = Sentence("In nomine sanctae et individuae trinitatis .")
	tagger.predict(sentence)

	for token in sentence:
	tag = token.get_tag("upos")
	print(f"{token.text}\t{tag.value}\t{tag.score:.4f}")

	```

	## Training Parameters
	* Learning Rate: 0.1
	* Mini Batch Size: 512
	* Max Epochs: 15
	* Optimizer: AnnealOnPlateau
	* Trained on a single GPU. Device: NVIDIA Blackwell 6000 Pro

	## Citation

	If you use this model, please cite the specific model DOI and the Flair framework:

	```bibtex
	@software{schonhardt_michael_2026_latin_pos,
	author = "Schonhardt, Michael",
	title = "Latin POS Tagger (Flair)",
	year = "2026",
	publisher = "Zenodo",
	doi = "10.5281/zenodo.18631267",
	url = "https://huggingface.co/mschonhardt/latin-pos-tagger"
	}
	```

	```bibtex
	@inproceedings{akbik-etal-2018-contextual,
	title = "Contextual String Embeddings for Sequence Labeling",
	author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
	booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
	year = "2018",
	pages = "1638--1649",
	publisher = "Association for Computational Linguistics"
	}
	```