LatinCy Stanza (la_stanza_latincy)
A Stanza (Stanford NLP) model suite for Latin trained on harmonized Universal Dependencies treebanks from LatinCy. Provides tokenization, POS tagging, morphological features, lemmatization, dependency parsing, and named entity recognition.
Highlights
- Full NLP pipeline -- tokenizer, POS/morph tagger, lemmatizer, dependency parser, NER
- 6 UD treebanks + LASLA: POS/morph/lemma trained on ~2.87M tokens (UD+LASLA combined)
- Custom character language models trained on 1.6 GB of curated Latin text (13.7M sentences)
- Custom word vectors (CBOW-300, trained on curated Latin corpus)
- NER with 3 entity types: PERSON, LOC, NORP
Quick Start
import stanza
from huggingface_hub import snapshot_download
# Download models (one time)
model_dir = snapshot_download("latincy/la_stanza_latincy")
# Load pipeline
nlp = stanza.Pipeline("la", dir=model_dir, download_method=None)
# Annotate
doc = nlp("Gallia est omnis divisa in partes tres.")
for sent in doc.sentences:
for word in sent.words:
print(f"{word.text:12s} {word.upos:6s} {word.lemma:12s} {word.deprel}")
Output:
Gallia PROPN Gallia nsubj
est AUX sum cop
omnis DET omnis det
divisa ADJ divido root
in ADP in case
partes NOUN pars obl
tres NUM tres nummod
. PUNCT . punct
NER
nlp = stanza.Pipeline("la", dir=model_dir, download_method=None,
processors="tokenize,ner")
doc = nlp("Caesar in Galliam cum legionibus contendit.")
for ent in doc.ents:
print(f"{ent.text:20s} {ent.type}")
Loading from a Local Directory
If you have the models locally (e.g., after cloning the HuggingFace repo):
nlp = stanza.Pipeline("la", dir="/path/to/la_stanza_latincy",
download_method=None)
Model Description
| Property | Value |
|---|---|
| Author | Patrick J. Burns / LatinCy |
| Model type | Stanza neural pipeline (BiLSTM-CRF, biaffine parser) |
| Language | Latin |
| License | MIT |
| Total size | ~1.1 GB (8 model files) |
| Framework | Stanza (Stanford NLP) |
Pipeline Components
| Component | Model File | Architecture |
|---|---|---|
| Tokenizer | tokenize/latincy.pt (11 MB) |
BiLSTM segmenter |
| POS/Morph | pos/latincy.pt (151 MB) |
BiLSTM tagger with CharLM + pretrained vectors |
| Lemmatizer | lemma/latincy.pt (46 MB) |
Seq2seq with edit classifier |
| Dep. Parser | depparse/latincy.pt (170 MB) |
Deep biaffine attention parser |
| NER | ner/latincy.pt (151 MB) |
BiLSTM-CRF with CharLM + pretrained vectors |
| CharLM (fwd) | forward_charlm/latincy.pt (197 MB) |
Character-level LSTM language model |
| CharLM (bwd) | backward_charlm/latincy.pt (197 MB) |
Character-level LSTM language model |
| Pretrain | pretrain/latincy.pt (174 MB) |
Word2Vec CBOW-300 embeddings |
Training Data
POS, Morphology, Lemmatization (UD + LASLA)
Trained on harmonized data from 6 Universal Dependencies Latin treebanks combined with the LASLA corpus (~1.84M tokens of classical Latin with POS, morphological features, and lemmas).
| Treebank | Full Name | Domain |
|---|---|---|
| ITTB | Index Thomisticus Treebank | Scholastic Latin (Thomas Aquinas) |
| LLCT | Late Latin Charter Treebank | Medieval legal charters |
| PROIEL | PROIEL Treebank | Vulgate Bible, historical texts |
| Perseus | Perseus Latin Treebank | Classical Latin (Caesar, Cicero, etc.) |
| UDante | UDante Treebank | Dante Alighieri (De vulgari eloquentia, etc.) |
| CIRCSE | CIRCSE Latin Treebank | LASLA-derived classical texts |
| LASLA | LASLA corpus | Classical Latin (morphology only, no deps) |
Combined: ~2.87M tokens for POS/morph/lemma; ~1.03M tokens (UD only) for tokenizer and dependency parsing.
NER
Trained on LatinCy NER annotations from 4 sources: 13,493 train / 3,195 dev sentences. Entity types: PERSON (79%), LOC (14%), NORP (7%).
Character Language Models
Trained on 1.6 GB of curated Latin text (13.7M sentences from 9 sources) for 15 epochs. Forward and backward CharLMs provide contextualized character-level features to the POS tagger, lemmatizer, parser, and NER.
Training Procedure
Tokenizer: BiLSTM segmenter trained on UD-only data.
POS/Morph tagger: BiLSTM with CharLM features and pretrained word vectors, trained on UD+LASLA combined data.
Lemmatizer: Seq2seq model with edit classifier, CharLM features, trained on UD+LASLA combined data.
Dependency parser: Deep biaffine attention parser with CharLM features and pretrained word vectors, trained on UD-only data.
NER tagger: BiLSTM-CRF with CharLM features and pretrained word vectors, 8,500 training steps with early stopping.
Evaluation Results
Overall Scores
| Component | Metric | Score | Split |
|---|---|---|---|
| Tokenizer | Token F1 | 98.24 | dev |
| Tokenizer | Sentence F1 | 86.59 | dev |
| POS | UPOS | 97.39 | test |
| POS | UFeats | 92.20 | test |
| Lemma | Accuracy | 97.79 | test |
| Dep. Parse | UAS | 86.73 | test |
| Dep. Parse | LAS | 83.23 | test |
| Dep. Parse | CLAS | 79.45 | test |
| Dep. Parse | MLAS | 77.17 | test |
| Dep. Parse | BLEX | 79.45 | test |
| NER | Entity F1 | 90.22 | dev |
| NER | PERSON F1 | 93.01 | dev |
| NER | LOC F1 | 80.88 | dev |
| NER | NORP F1 | 78.44 | dev |
Cross-Framework Comparison (LatinCy v3.8)
All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).
| Metric | LatinCy Stanza 0.1 |
LatinCy Flair 0.1 |
LatinCy UDPipe 0.1 |
LatinCy spaCy lg 3.8.0 |
|---|---|---|---|---|
| UPOS | 97.39 | 97.11 | 93.28 | 97.26 |
| UFeats | 92.20 | -- | 82.48 | 92.58 |
| Lemma | 97.79 | 96.52 | 93.05 | 94.87 |
| UAS | 86.73 | -- | 76.11 | 84.03 |
| LAS | 83.23 | -- | 71.29 | 78.89 |
| NER F1 | 90.22 | 90.48 | -- | 82.26 |
Stanza leads on UPOS, lemma, and dependency parsing. spaCy leads on morphological features (UFeats). Flair is competitive on POS/lemma and leads on NER. UDPipe offers single-file portability usable from R, Python, CLI, and other platforms.
vs. Stanford's Official Latin Package (stanfordnlp/stanza-la)
Stanford distributes separate per-treebank models (ITTB, LLCT, Perseus, PROIEL, UDante) without character language models (nocharlm variants) and without NER. LatinCy Stanza trains a single unified model across all treebanks plus LASLA, with custom forward/backward CharLMs and pretrained word vectors. A direct benchmark comparison is planned for a future release.
Limitations
- No test split for NER: NER scores are on the dev set; no held-out test evaluation is available.
- Tokenizer scores on dev: No separate test evaluation was run for the tokenizer.
- LASLA data is morphology-only: Dependency parsing trained on UD data only (~1.03M tokens), not the full 2.87M token corpus.
- No transformer features: This is a Phase 1 model using BiLSTM + CharLM. Phase 2 will integrate a transformer model.
- Large total size: The full model suite is ~1.1 GB due to 8 separate model files (including 2 CharLMs at 197 MB each). Individual components can be loaded selectively.
Future Development
The following Stanza processors are not yet implemented for Latin in this release but will be considered for future development:
- Constituency parsing (phrase structure)
- Coreference resolution
- Sentiment analysis
- Multi-word token (MWT) expansion
Also, we expect to train the next version of LatinCy Stanza using a transformer model for improved accuracy on morphological features and dependency parsing.
References
- Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. 2020. "Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. https://nlp.stanford.edu/pubs/qi2020stanza.pdf.
Citation
@misc{burns2026latincystanza,
author = {Burns, Patrick J.},
title = {{LatinCy Stanza (la\_stanza\_latincy)}},
year = {2026},
url = {https://huggingface.co/latincy/la_stanza_latincy},
}
Acknowledgments
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
- Downloads last month
- -
Evaluation results
- Token F1 on UD Latin (combined)self-reported98.240
- Sentence F1 on UD Latin (combined)self-reported86.590
- UPOS on UD Latin (combined + LASLA)self-reported97.390
- UFeats on UD Latin (combined + LASLA)self-reported92.200
- Lemma Accuracy on UD Latin (combined + LASLA)self-reported97.790
- UAS on UD Latin (combined)self-reported86.730
- LAS on UD Latin (combined)self-reported83.230
- Entity F1 on LatinCy NER (4 sources)self-reported90.220