LatinCy Stanza (la_stanza_latincy)

A Stanza (Stanford NLP) model suite for Latin trained on harmonized Universal Dependencies treebanks from LatinCy. Provides tokenization, POS tagging, morphological features, lemmatization, dependency parsing, and named entity recognition.

Highlights

  • Full NLP pipeline -- tokenizer, POS/morph tagger, lemmatizer, dependency parser, NER
  • 6 UD treebanks + LASLA: POS/morph/lemma trained on ~2.87M tokens (UD+LASLA combined)
  • Custom character language models trained on 1.6 GB of curated Latin text (13.7M sentences)
  • Custom word vectors (CBOW-300, trained on curated Latin corpus)
  • NER with 3 entity types: PERSON, LOC, NORP

Quick Start

import stanza
from huggingface_hub import snapshot_download

# Download models (one time)
model_dir = snapshot_download("latincy/la_stanza_latincy")

# Load pipeline
nlp = stanza.Pipeline("la", dir=model_dir, download_method=None)

# Annotate
doc = nlp("Gallia est omnis divisa in partes tres.")
for sent in doc.sentences:
    for word in sent.words:
        print(f"{word.text:12s} {word.upos:6s} {word.lemma:12s} {word.deprel}")

Output:

Gallia       PROPN  Gallia       nsubj
est          AUX    sum          cop
omnis        DET    omnis        det
divisa       ADJ    divido       root
in           ADP    in           case
partes       NOUN   pars         obl
tres         NUM    tres         nummod
.            PUNCT  .            punct

NER

nlp = stanza.Pipeline("la", dir=model_dir, download_method=None,
                       processors="tokenize,ner")
doc = nlp("Caesar in Galliam cum legionibus contendit.")
for ent in doc.ents:
    print(f"{ent.text:20s} {ent.type}")

Loading from a Local Directory

If you have the models locally (e.g., after cloning the HuggingFace repo):

nlp = stanza.Pipeline("la", dir="/path/to/la_stanza_latincy",
                       download_method=None)

Model Description

Property Value
Author Patrick J. Burns / LatinCy
Model type Stanza neural pipeline (BiLSTM-CRF, biaffine parser)
Language Latin
License MIT
Total size ~1.1 GB (8 model files)
Framework Stanza (Stanford NLP)

Pipeline Components

Component Model File Architecture
Tokenizer tokenize/latincy.pt (11 MB) BiLSTM segmenter
POS/Morph pos/latincy.pt (151 MB) BiLSTM tagger with CharLM + pretrained vectors
Lemmatizer lemma/latincy.pt (46 MB) Seq2seq with edit classifier
Dep. Parser depparse/latincy.pt (170 MB) Deep biaffine attention parser
NER ner/latincy.pt (151 MB) BiLSTM-CRF with CharLM + pretrained vectors
CharLM (fwd) forward_charlm/latincy.pt (197 MB) Character-level LSTM language model
CharLM (bwd) backward_charlm/latincy.pt (197 MB) Character-level LSTM language model
Pretrain pretrain/latincy.pt (174 MB) Word2Vec CBOW-300 embeddings

Training Data

POS, Morphology, Lemmatization (UD + LASLA)

Trained on harmonized data from 6 Universal Dependencies Latin treebanks combined with the LASLA corpus (~1.84M tokens of classical Latin with POS, morphological features, and lemmas).

Treebank Full Name Domain
ITTB Index Thomisticus Treebank Scholastic Latin (Thomas Aquinas)
LLCT Late Latin Charter Treebank Medieval legal charters
PROIEL PROIEL Treebank Vulgate Bible, historical texts
Perseus Perseus Latin Treebank Classical Latin (Caesar, Cicero, etc.)
UDante UDante Treebank Dante Alighieri (De vulgari eloquentia, etc.)
CIRCSE CIRCSE Latin Treebank LASLA-derived classical texts
LASLA LASLA corpus Classical Latin (morphology only, no deps)

Combined: ~2.87M tokens for POS/morph/lemma; ~1.03M tokens (UD only) for tokenizer and dependency parsing.

NER

Trained on LatinCy NER annotations from 4 sources: 13,493 train / 3,195 dev sentences. Entity types: PERSON (79%), LOC (14%), NORP (7%).

Character Language Models

Trained on 1.6 GB of curated Latin text (13.7M sentences from 9 sources) for 15 epochs. Forward and backward CharLMs provide contextualized character-level features to the POS tagger, lemmatizer, parser, and NER.

Training Procedure

Tokenizer: BiLSTM segmenter trained on UD-only data.

POS/Morph tagger: BiLSTM with CharLM features and pretrained word vectors, trained on UD+LASLA combined data.

Lemmatizer: Seq2seq model with edit classifier, CharLM features, trained on UD+LASLA combined data.

Dependency parser: Deep biaffine attention parser with CharLM features and pretrained word vectors, trained on UD-only data.

NER tagger: BiLSTM-CRF with CharLM features and pretrained word vectors, 8,500 training steps with early stopping.

Evaluation Results

Overall Scores

Component Metric Score Split
Tokenizer Token F1 98.24 dev
Tokenizer Sentence F1 86.59 dev
POS UPOS 97.39 test
POS UFeats 92.20 test
Lemma Accuracy 97.79 test
Dep. Parse UAS 86.73 test
Dep. Parse LAS 83.23 test
Dep. Parse CLAS 79.45 test
Dep. Parse MLAS 77.17 test
Dep. Parse BLEX 79.45 test
NER Entity F1 90.22 dev
NER PERSON F1 93.01 dev
NER LOC F1 80.88 dev
NER NORP F1 78.44 dev

Cross-Framework Comparison (LatinCy v3.8)

All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).

Metric LatinCy
Stanza 0.1
LatinCy
Flair 0.1
LatinCy
UDPipe 0.1
LatinCy
spaCy lg 3.8.0
UPOS 97.39 97.11 93.28 97.26
UFeats 92.20 -- 82.48 92.58
Lemma 97.79 96.52 93.05 94.87
UAS 86.73 -- 76.11 84.03
LAS 83.23 -- 71.29 78.89
NER F1 90.22 90.48 -- 82.26

Stanza leads on UPOS, lemma, and dependency parsing. spaCy leads on morphological features (UFeats). Flair is competitive on POS/lemma and leads on NER. UDPipe offers single-file portability usable from R, Python, CLI, and other platforms.

vs. Stanford's Official Latin Package (stanfordnlp/stanza-la)

Stanford distributes separate per-treebank models (ITTB, LLCT, Perseus, PROIEL, UDante) without character language models (nocharlm variants) and without NER. LatinCy Stanza trains a single unified model across all treebanks plus LASLA, with custom forward/backward CharLMs and pretrained word vectors. A direct benchmark comparison is planned for a future release.

Limitations

  • No test split for NER: NER scores are on the dev set; no held-out test evaluation is available.
  • Tokenizer scores on dev: No separate test evaluation was run for the tokenizer.
  • LASLA data is morphology-only: Dependency parsing trained on UD data only (~1.03M tokens), not the full 2.87M token corpus.
  • No transformer features: This is a Phase 1 model using BiLSTM + CharLM. Phase 2 will integrate a transformer model.
  • Large total size: The full model suite is ~1.1 GB due to 8 separate model files (including 2 CharLMs at 197 MB each). Individual components can be loaded selectively.

Future Development

The following Stanza processors are not yet implemented for Latin in this release but will be considered for future development:

  • Constituency parsing (phrase structure)
  • Coreference resolution
  • Sentiment analysis
  • Multi-word token (MWT) expansion

Also, we expect to train the next version of LatinCy Stanza using a transformer model for improved accuracy on morphological features and dependency parsing.

References

  • Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. 2020. "Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. https://nlp.stanford.edu/pubs/qi2020stanza.pdf.

Citation

@misc{burns2026latincystanza,
  author = {Burns, Patrick J.},
  title = {{LatinCy Stanza (la\_stanza\_latincy)}},
  year = {2026},
  url = {https://huggingface.co/latincy/la_stanza_latincy},
}

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results