LatinCy UDPipe (la_udpipe_latincy_multi)
A UDPipe 1 model for Latin trained on six harmonized Universal Dependencies treebanks from LatinCy. Provides tokenization, POS tagging, lemmatization, morphological features, and dependency parsing in a single file.
Highlights
- Single-file model (71.3 MB) -- works offline, no GPU needed
- 6 treebanks: ITTB, LLCT, PROIEL, Perseus, UDante, CIRCSE (48,754 training sentences)
- Drop-in replacement for existing UDPipe Latin workflows in R, Python, CLI, Java, C#, and Perl
- Optimized with pre-trained word embeddings, swap parser, and two-model tagger
Quick Start (R)
The primary audience for this model is R users working with the udpipe package.
library(udpipe)
# Download the model (one time)
model_url <- "https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe"
model_path <- "la_udpipe_latincy_multi.udpipe"
if (!file.exists(model_path)) {
download.file(model_url, model_path, mode = "wb")
}
# Load and annotate
model <- udpipe_load_model(model_path)
text <- "Gallia est omnis divisa in partes tres."
result <- udpipe_annotate(model, x = text)
df <- as.data.frame(result)
print(df[, c("token", "upos", "lemma", "dep_rel")])
Output:
token upos lemma dep_rel
1 Gallia PROPN Gallia nsubj
2 est AUX sum cop
3 omnis DET omnis det
4 divisa ADJ divisa root
5 in ADP in case
6 partes NOUN pars obl
7 tres NUM tres nummod
8 . PUNCT . punct
Quick Start (Python)
from ufal.udpipe import Model, Pipeline
model = Model.load("la_udpipe_latincy_multi.udpipe")
pipeline = Pipeline(model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")
result = pipeline.process("Gallia est omnis divisa in partes tres.")
print(result)
Install the Python bindings with pip install ufal.udpipe.
Quick Start (CLI)
# Download the model
curl -L -o la_udpipe_latincy_multi.udpipe \
https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe
# Annotate text
echo "Gallia est omnis divisa in partes tres." | \
udpipe --tokenize --tag --parse la_udpipe_latincy_multi.udpipe
Model Description
| Property | Value |
|---|---|
| Author | Patrick J. Burns / LatinCy |
| Model type | UDPipe 1 (MorphoDiTa tagger + Parsito parser) |
| Language | Latin |
| License | MIT |
| File size | 71.3 MB (larger than typical single-treebank UDPipe models due to multi-treebank training data and embedded word vectors) |
| Framework | UDPipe 1 |
Training Data
Trained on harmonized data from 6 Universal Dependencies Latin treebanks, prepared by the LatinCy treebank pipeline. Treebanks are harmonized to consistent annotation standards before combining.
| Treebank | Full Name | Domain |
|---|---|---|
| ITTB | Index Thomisticus Treebank | Scholastic Latin (Thomas Aquinas) |
| LLCT | Late Latin Charter Treebank | Medieval legal charters |
| PROIEL | PROIEL Treebank | Vulgate Bible, historical texts |
| Perseus | Perseus Latin Treebank | Classical Latin (Caesar, Cicero, etc.) |
| UDante | UDante Treebank | Dante Alighieri (De vulgari eloquentia, etc.) |
| CIRCSE | CIRCSE Latin Treebank | LASLA-derived classical texts |
Split sizes: 48,754 train / 4,613 dev / 6,423 test sentences.
Training Procedure
Trained using ufal.udpipe Python bindings (UDPipe 1, morphodita_parsito backend).
Tokenizer:
- 100 epochs max, batch size 50, learning rate 0.005, dropout 0.1
- Early stopping enabled (stopped at epoch 42)
Tagger (two-model MorphoDiTa configuration):
- Model 1: UPOS, XPOS, morphological features (20 iterations, tagger templates)
- Model 2: Lemma (20 iterations, lemmatizer templates)
Parser (Parsito):
- Swap transition system (handles non-projective dependencies)
- Static lazy oracle, 200-unit hidden layer
- Pre-trained word2vec embeddings (CBOW-300, 142K vocab, 96.1% form coverage)
- 10 iterations max, early stopping (stopped at iteration 9)
Evaluation Results
Evaluated on held-out test data (6,423 sentences) with gold tokenization using conll18_ud_eval.py.
Overall Scores
| Metric | F1 |
|---|---|
| UPOS | 93.28 |
| UFeats | 82.48 |
| Lemma | 93.05 |
| UAS | 76.11 |
| LAS | 71.29 |
| CLAS | 65.83 |
| MLAS | 52.52 |
| BLEX | 61.77 |
Per-Treebank Breakdown
| Treebank | UPOS | UFeats | Lemma | UAS | LAS |
|---|---|---|---|---|---|
| ITTB | 96.37 | 91.51 | 97.96 | 82.58 | 79.29 |
| LLCT | 95.49 | 90.68 | 96.94 | 90.48 | 88.95 |
| PROIEL | 93.42 | 78.94 | 94.45 | 67.76 | 62.45 |
| Perseus | 89.80 | 72.13 | 85.95 | 64.52 | 56.18 |
| UDante | 87.44 | 72.79 | 85.39 | 63.57 | 55.45 |
| CIRCSE | 87.50 | 68.14 | 86.21 | 48.09 | 40.38 |
Comparison with Stock UDPipe UD 2.5 Models
The UDPipe UD 2.5 models (Straka & Straková, December 2019) are single-treebank models trained with default hyperparameters and distributed with UDPipe. Three exist for Latin: ITTB, Perseus, and PROIEL. No stock models exist for LLCT, CIRCSE, or UDante.
This model differs from stock models in three ways: (1) training on all 6 treebanks rather than one, (2) harmonized annotations across treebanks, and (3) optimized hyperparameters including pre-trained embeddings and a swap parser.
LAS F1 comparison:
| Test Set | LatinCy | Stock-ITTB | Stock-Perseus | Stock-PROIEL |
|---|---|---|---|---|
| ITTB | 79.29 | 66.09 | 36.18 | 39.31 |
| LLCT | 88.95 | 30.84 | 22.80 | 29.35 |
| PROIEL | 62.45 | 39.83 | 33.84 | 50.67 |
| Perseus | 56.18 | 38.13 | 40.74 | 34.49 |
| UDante | 55.45 | 42.46 | 26.63 | 30.69 |
| CIRCSE | 40.38 | 23.32 | 21.90 | 27.43 |
Observations:
- The LatinCy model outperforms stock models across all treebanks, including on each stock model's own training domain (ITTB +13.20, Perseus +15.44, PROIEL +11.78 over best stock)
- Single-treebank stock models show expected domain sensitivity when applied cross-domain (e.g., Stock-ITTB on LLCT: 30.84 LAS)
- Multi-treebank training on harmonized data provides broader coverage across Latin text types
Cross-Framework Comparison (LatinCy v3.8)
All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).
| Metric | LatinCy UDPipe 0.1 |
LatinCy Stanza 0.1 |
LatinCy Flair 0.1 |
LatinCy spaCy lg 3.8.0 |
|---|---|---|---|---|
| UPOS | 93.28 | 97.39 | 97.11 | 97.26 |
| UFeats | 82.48 | 92.20 | -- | 92.58 |
| Lemma | 93.05 | 97.79 | 96.52 | 94.87 |
| UAS | 76.11 | 86.73 | -- | 84.03 |
| LAS | 71.29 | 83.23 | -- | 78.89 |
| NER F1 | -- | 90.22 | 90.48 | 82.26 |
UDPipe's strength is portability: a single file usable from R, Python, CLI, Java, C#, and Perl with no GPU and no framework dependencies. Stanza and Flair offer higher accuracy when Python and GPU resources are available.
R Usage Guide
Installation
install.packages("udpipe")
Download and Load
library(udpipe)
# Download the model
model_url <- "https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe"
model_path <- "la_udpipe_latincy_multi.udpipe"
if (!file.exists(model_path)) {
download.file(model_url, model_path, mode = "wb")
}
# Load
model <- udpipe_load_model(model_path)
Basic Annotation
text <- "Arma virumque cano, Troiae qui primus ab oris
Italiam fato profugus Laviniaque venit litora."
result <- udpipe_annotate(model, x = text)
df <- as.data.frame(result)
Tidyverse Integration
library(dplyr)
# Count POS tags
df %>%
count(upos, sort = TRUE)
# Extract nouns with their lemmas
df %>%
filter(upos == "NOUN") %>%
select(token, lemma, feats)
# Get dependency relations
df %>%
select(token_id, token, head_token_id, dep_rel, upos)
Replacing a Stock Model
If you previously used a stock UDPipe Latin model, replace it by pointing to this model file instead:
# Before (stock model)
# model <- udpipe_download_model(language = "latin-ittb")
# model <- udpipe_load_model(model$file_model)
# After (LatinCy model)
model <- udpipe_load_model("la_udpipe_latincy_multi.udpipe")
# Everything else stays the same
Limitations
- UPOS/UFeats slightly below baseline: The two-model tagger configuration trades ~1.3 UPOS / ~1.6 UFeats points for +4.2 LAS and +1.9 Lemma. The overall trade-off is positive for downstream tasks.
- UDPipe 1 architecture: Feature-based model (no transformers). For higher accuracy, consider LatinCy spaCy models (
la_core_web_trf: 79.8 LAS). - Gold tokenization evaluation: Reported scores use gold tokenization. Real-world performance on raw text depends on the tokenizer (99.6% F1 on held-out data).
References
- Straka, M., Hajič, J., and Straková, J. 2016. "UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing." In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., et al. eds. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA). 4290–97. https://aclanthology.org/L16-1680/.
Citation
@misc{burns2026latincyudpipe,
author = {Burns, Patrick J.},
title = {{LatinCy UDPipe (la\_udpipe\_latincy\_multi)}},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/latincy/la_udpipe_latincy},
}
Acknowledgments
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
Evaluation results
- UPOS F1 on UD Latin (6 treebanks)self-reported93.280
- UFeats F1 on UD Latin (6 treebanks)self-reported82.480
- Lemma F1 on UD Latin (6 treebanks)self-reported93.050
- UAS F1 on UD Latin (6 treebanks)self-reported76.110
- LAS F1 on UD Latin (6 treebanks)self-reported71.290