LatinCy UDPipe (la_udpipe_latincy_multi)

A UDPipe 1 model for Latin trained on six harmonized Universal Dependencies treebanks from LatinCy. Provides tokenization, POS tagging, lemmatization, morphological features, and dependency parsing in a single file.

Highlights

Single-file model (72.3 MB) -- works offline, no GPU needed
6 treebanks: ITTB, LLCT, PROIEL, Perseus, UDante, CIRCSE (48,754 training sentences)
Drop-in replacement for existing UDPipe Latin workflows in R, Python, CLI, Java, C#, and Perl
Optimized with pre-trained word embeddings, swap parser, and two-model tagger

Quick Start (R)

The primary audience for this model is R users working with the udpipe package.

library(udpipe)

# Download the model (one time)
model_url <- "https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe"
model_path <- "la_udpipe_latincy_multi.udpipe"
if (!file.exists(model_path)) {
  download.file(model_url, model_path, mode = "wb")
}

# Load and annotate
model <- udpipe_load_model(model_path)
text <- "Gallia est omnis divisa in partes tres."
result <- udpipe_annotate(model, x = text)
df <- as.data.frame(result)
print(df[, c("token", "upos", "lemma", "dep_rel")])

Output:

    token  upos  lemma dep_rel
1  Gallia PROPN Gallia   nsubj
2     est   AUX    sum     cop
3   omnis   DET  omnis     det
4  divisa   ADJ divisa    root
5      in   ADP     in    case
6  partes  NOUN   pars     obl
7    tres   NUM   tres  nummod
8       .  PUNCT     .   punct

Quick Start (Python)

from ufal.udpipe import Model, Pipeline

model = Model.load("la_udpipe_latincy_multi.udpipe")
pipeline = Pipeline(model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")
result = pipeline.process("Gallia est omnis divisa in partes tres.")
print(result)

Install the Python bindings with pip install ufal.udpipe.

Quick Start (CLI)

# Download the model
curl -L -o la_udpipe_latincy_multi.udpipe \
  https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe

# Annotate text
echo "Gallia est omnis divisa in partes tres." | \
  udpipe --tokenize --tag --parse la_udpipe_latincy_multi.udpipe

Model Description

Property	Value
Author	Patrick J. Burns / LatinCy
Model type	UDPipe 1 (MorphoDiTa tagger + Parsito parser)
Language	Latin
License	MIT
File size	72.3 MB (larger than typical single-treebank UDPipe models due to multi-treebank training data and embedded word vectors)
Version	0.2
Framework	UDPipe 1

Training Data

Trained on harmonized data from 6 Universal Dependencies Latin treebanks, prepared by the LatinCy treebank pipeline. Treebanks are harmonized to consistent annotation standards before combining.

Treebank	Full Name	Domain
ITTB	Index Thomisticus Treebank	Scholastic Latin (Thomas Aquinas)
LLCT	Late Latin Charter Treebank	Medieval legal charters
PROIEL	PROIEL Treebank	Vulgate Bible, historical texts
Perseus	PerseusDep Latin Treebank	Classical Latin (Caesar, Cicero, etc.)
UDante	UDante Treebank	Dante Alighieri (De vulgari eloquentia, etc.)
CIRCSE	CIRCSE Latin Treebank	LASLA-derived classical texts

Split sizes: 48,754 train / 4,613 dev / 6,423 test sentences.

Training Procedure

Trained using ufal.udpipe Python bindings (UDPipe 1, morphodita_parsito backend).

Tokenizer:

100 epochs max, batch size 50, learning rate 0.005, dropout 0.1
Early stopping enabled (stopped at epoch 42)

Tagger (two-model MorphoDiTa configuration):

Model 1: UPOS, XPOS, morphological features (20 iterations, tagger templates)
Model 2: Lemma (20 iterations, lemmatizer templates)

Parser (Parsito):

Swap transition system (handles non-projective dependencies)
Static lazy oracle, 200-unit hidden layer
Pre-trained word2vec embeddings (CBOW-300, 142K vocab, 96.1% form coverage)
10 iterations max, early stopping (stopped at iteration 9)

Evaluation Results

Evaluated on held-out test data (6,423 sentences) with gold tokenization using conll18_ud_eval.py.

Overall Scores (v0.2)

Weighted average across 6 test treebanks (106,489 words).

Metric	F1
UPOS	94.07
UFeats	80.82
Lemma	92.99
UAS	76.48
LAS	71.57
CLAS	67.67
MLAS	53.86
BLEX	64.44

Per-Treebank Breakdown (v0.2)

Treebank	UPOS	UFeats	Lemma	UAS	LAS
ITTB	97.03	88.36	97.90	83.69	80.33
LLCT	97.96	87.15	96.90	91.62	89.91
PROIEL	94.73	76.26	94.44	74.69	69.44
Perseus	90.26	70.43	85.98	66.13	57.73
UDante	87.82	75.71	85.19	64.02	55.80
CIRCSE	88.48	70.28	86.16	53.52	45.64

Comparison with Stock UDPipe UD 2.5 Models

The UDPipe UD 2.5 models (Straka & Straková, December 2019) are single-treebank models trained with default hyperparameters and distributed with UDPipe. Three exist for Latin: ITTB, Perseus, and PROIEL. No stock models exist for LLCT, CIRCSE, or UDante.

This model differs from stock models in three ways: (1) training on all 6 treebanks rather than one, (2) harmonized annotations across treebanks, and (3) optimized hyperparameters including pre-trained embeddings and a swap parser.

LAS F1 comparison (v0.2):

Test Set	LatinCy	Stock-ITTB	Stock-Perseus	Stock-PROIEL
ITTB	80.33	66.94	36.75	39.31
LLCT	89.91	31.20	23.19	29.35
PROIEL	69.44	44.79	39.14	50.67
Perseus	57.73	39.58	42.26	34.49
UDante	55.80	42.47	26.63	30.69
CIRCSE	45.64	26.81	25.77	27.43

Observations:

The LatinCy model outperforms stock models across all treebanks, including on each stock model's own training domain (ITTB +13.39, Perseus +15.47, PROIEL +18.77 over best stock)
Single-treebank stock models show expected domain sensitivity when applied cross-domain (e.g., Stock-ITTB on LLCT: 31.20 LAS)
Multi-treebank training on harmonized data provides broader coverage across Latin text types

Cross-Framework Comparison (LatinCy v3.9)

All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).

Metric	LatinCy UDPipe 0.2	LatinCy Stanza 0.3	LatinCy Flair 0.3	LatinCy spaCy trf 3.9
UPOS	94.07	97.65	98.02	97.34
UFeats	80.82	93.93	--	93.95
Lemma	92.99	97.87	97.41	94.63
UAS	76.48	86.95	--	86.91
LAS	71.57	83.23	--	82.04
NER F1	--	90.22	92.22	91.14

UDPipe's strength is portability: a single file usable from R, Python, CLI, Java, C#, and Perl with no GPU and no framework dependencies. Stanza leads on lemma and dependency parsing. Flair 0.3 (Latin BERT) leads on UPOS and NER.

R Usage Guide

Installation

install.packages("udpipe")

Download and Load

library(udpipe)

# Download the model
model_url <- "https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe"
model_path <- "la_udpipe_latincy_multi.udpipe"
if (!file.exists(model_path)) {
  download.file(model_url, model_path, mode = "wb")
}

# Load
model <- udpipe_load_model(model_path)

Basic Annotation

text <- "Arma virumque cano, Troiae qui primus ab oris
Italiam fato profugus Laviniaque venit litora."
result <- udpipe_annotate(model, x = text)
df <- as.data.frame(result)

Tidyverse Integration

library(dplyr)

# Count POS tags
df %>%
  count(upos, sort = TRUE)

# Extract nouns with their lemmas
df %>%
  filter(upos == "NOUN") %>%
  select(token, lemma, feats)

# Get dependency relations
df %>%
  select(token_id, token, head_token_id, dep_rel, upos)

Replacing a Stock Model

If you previously used a stock UDPipe Latin model, replace it by pointing to this model file instead:

# Before (stock model)
# model <- udpipe_download_model(language = "latin-ittb")
# model <- udpipe_load_model(model$file_model)

# After (LatinCy model)
model <- udpipe_load_model("la_udpipe_latincy_multi.udpipe")
# Everything else stays the same

Limitations

UFeats trade-off: v0.2 improves UPOS (+0.8) and LAS (+0.3) over v0.1 but trades ~1.7 UFeats points. The tagger template configuration prioritizes tagging and parsing accuracy.
UDPipe 1 architecture: Feature-based model (no transformers). For higher accuracy, consider LatinCy spaCy models (la_core_web_trf: 79.8 LAS).
Gold tokenization evaluation: Reported scores use gold tokenization. Real-world performance on raw text depends on the tokenizer (99.6% F1 on held-out data).

References

Straka, M., and Straková, J. 2017. "Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe." In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Vancouver, Canada: Association for Computational Linguistics. 88–99. https://aclanthology.org/K17-3009/.
Straka, M., Hajič, J., and Straková, J. 2016. "UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing." In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., et al. eds. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA). 4290–97. https://aclanthology.org/L16-1680/.

Citation

@misc{burns2026latincyudpipe,
  author = {Burns, Patrick J.},
  title = {{LatinCy UDPipe (la\_udpipe\_latincy\_multi)}},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/latincy/la_udpipe_latincy},
}

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

UPOS F1 on UD Latin (6 treebanks)
self-reported

94.070
UFeats F1 on UD Latin (6 treebanks)
self-reported

80.820
Lemma F1 on UD Latin (6 treebanks)
self-reported

92.990
UAS F1 on UD Latin (6 treebanks)
self-reported

76.480
LAS F1 on UD Latin (6 treebanks)
self-reported

71.570