LatinCy UDPipe (la_udpipe_latincy_multi)

A UDPipe 1 model for Latin trained on six harmonized Universal Dependencies treebanks from LatinCy. Provides tokenization, POS tagging, lemmatization, morphological features, and dependency parsing in a single file.

Highlights

  • Single-file model (71.3 MB) -- works offline, no GPU needed
  • 6 treebanks: ITTB, LLCT, PROIEL, Perseus, UDante, CIRCSE (48,754 training sentences)
  • Drop-in replacement for existing UDPipe Latin workflows in R, Python, CLI, Java, C#, and Perl
  • Optimized with pre-trained word embeddings, swap parser, and two-model tagger

Quick Start (R)

The primary audience for this model is R users working with the udpipe package.

library(udpipe)

# Download the model (one time)
model_url <- "https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe"
model_path <- "la_udpipe_latincy_multi.udpipe"
if (!file.exists(model_path)) {
  download.file(model_url, model_path, mode = "wb")
}

# Load and annotate
model <- udpipe_load_model(model_path)
text <- "Gallia est omnis divisa in partes tres."
result <- udpipe_annotate(model, x = text)
df <- as.data.frame(result)
print(df[, c("token", "upos", "lemma", "dep_rel")])

Output:

    token  upos  lemma dep_rel
1  Gallia PROPN Gallia   nsubj
2     est   AUX    sum     cop
3   omnis   DET  omnis     det
4  divisa   ADJ divisa    root
5      in   ADP     in    case
6  partes  NOUN   pars     obl
7    tres   NUM   tres  nummod
8       .  PUNCT     .   punct

Quick Start (Python)

from ufal.udpipe import Model, Pipeline

model = Model.load("la_udpipe_latincy_multi.udpipe")
pipeline = Pipeline(model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")
result = pipeline.process("Gallia est omnis divisa in partes tres.")
print(result)

Install the Python bindings with pip install ufal.udpipe.

Quick Start (CLI)

# Download the model
curl -L -o la_udpipe_latincy_multi.udpipe \
  https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe

# Annotate text
echo "Gallia est omnis divisa in partes tres." | \
  udpipe --tokenize --tag --parse la_udpipe_latincy_multi.udpipe

Model Description

Property Value
Author Patrick J. Burns / LatinCy
Model type UDPipe 1 (MorphoDiTa tagger + Parsito parser)
Language Latin
License MIT
File size 71.3 MB (larger than typical single-treebank UDPipe models due to multi-treebank training data and embedded word vectors)
Framework UDPipe 1

Training Data

Trained on harmonized data from 6 Universal Dependencies Latin treebanks, prepared by the LatinCy treebank pipeline. Treebanks are harmonized to consistent annotation standards before combining.

Treebank Full Name Domain
ITTB Index Thomisticus Treebank Scholastic Latin (Thomas Aquinas)
LLCT Late Latin Charter Treebank Medieval legal charters
PROIEL PROIEL Treebank Vulgate Bible, historical texts
Perseus Perseus Latin Treebank Classical Latin (Caesar, Cicero, etc.)
UDante UDante Treebank Dante Alighieri (De vulgari eloquentia, etc.)
CIRCSE CIRCSE Latin Treebank LASLA-derived classical texts

Split sizes: 48,754 train / 4,613 dev / 6,423 test sentences.

Training Procedure

Trained using ufal.udpipe Python bindings (UDPipe 1, morphodita_parsito backend).

Tokenizer:

  • 100 epochs max, batch size 50, learning rate 0.005, dropout 0.1
  • Early stopping enabled (stopped at epoch 42)

Tagger (two-model MorphoDiTa configuration):

  • Model 1: UPOS, XPOS, morphological features (20 iterations, tagger templates)
  • Model 2: Lemma (20 iterations, lemmatizer templates)

Parser (Parsito):

  • Swap transition system (handles non-projective dependencies)
  • Static lazy oracle, 200-unit hidden layer
  • Pre-trained word2vec embeddings (CBOW-300, 142K vocab, 96.1% form coverage)
  • 10 iterations max, early stopping (stopped at iteration 9)

Evaluation Results

Evaluated on held-out test data (6,423 sentences) with gold tokenization using conll18_ud_eval.py.

Overall Scores

Metric F1
UPOS 93.28
UFeats 82.48
Lemma 93.05
UAS 76.11
LAS 71.29
CLAS 65.83
MLAS 52.52
BLEX 61.77

Per-Treebank Breakdown

Treebank UPOS UFeats Lemma UAS LAS
ITTB 96.37 91.51 97.96 82.58 79.29
LLCT 95.49 90.68 96.94 90.48 88.95
PROIEL 93.42 78.94 94.45 67.76 62.45
Perseus 89.80 72.13 85.95 64.52 56.18
UDante 87.44 72.79 85.39 63.57 55.45
CIRCSE 87.50 68.14 86.21 48.09 40.38

Comparison with Stock UDPipe UD 2.5 Models

The UDPipe UD 2.5 models (Straka & Straková, December 2019) are single-treebank models trained with default hyperparameters and distributed with UDPipe. Three exist for Latin: ITTB, Perseus, and PROIEL. No stock models exist for LLCT, CIRCSE, or UDante.

This model differs from stock models in three ways: (1) training on all 6 treebanks rather than one, (2) harmonized annotations across treebanks, and (3) optimized hyperparameters including pre-trained embeddings and a swap parser.

LAS F1 comparison:

Test Set LatinCy Stock-ITTB Stock-Perseus Stock-PROIEL
ITTB 79.29 66.09 36.18 39.31
LLCT 88.95 30.84 22.80 29.35
PROIEL 62.45 39.83 33.84 50.67
Perseus 56.18 38.13 40.74 34.49
UDante 55.45 42.46 26.63 30.69
CIRCSE 40.38 23.32 21.90 27.43

Observations:

  • The LatinCy model outperforms stock models across all treebanks, including on each stock model's own training domain (ITTB +13.20, Perseus +15.44, PROIEL +11.78 over best stock)
  • Single-treebank stock models show expected domain sensitivity when applied cross-domain (e.g., Stock-ITTB on LLCT: 30.84 LAS)
  • Multi-treebank training on harmonized data provides broader coverage across Latin text types

Cross-Framework Comparison (LatinCy v3.8)

All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).

Metric LatinCy
UDPipe 0.1
LatinCy
Stanza 0.1
LatinCy
Flair 0.1
LatinCy
spaCy lg 3.8.0
UPOS 93.28 97.39 97.11 97.26
UFeats 82.48 92.20 -- 92.58
Lemma 93.05 97.79 96.52 94.87
UAS 76.11 86.73 -- 84.03
LAS 71.29 83.23 -- 78.89
NER F1 -- 90.22 90.48 82.26

UDPipe's strength is portability: a single file usable from R, Python, CLI, Java, C#, and Perl with no GPU and no framework dependencies. Stanza and Flair offer higher accuracy when Python and GPU resources are available.

R Usage Guide

Installation

install.packages("udpipe")

Download and Load

library(udpipe)

# Download the model
model_url <- "https://huggingface.co/latincy/la_udpipe_latincy/resolve/main/la_udpipe_latincy_multi.udpipe"
model_path <- "la_udpipe_latincy_multi.udpipe"
if (!file.exists(model_path)) {
  download.file(model_url, model_path, mode = "wb")
}

# Load
model <- udpipe_load_model(model_path)

Basic Annotation

text <- "Arma virumque cano, Troiae qui primus ab oris
Italiam fato profugus Laviniaque venit litora."
result <- udpipe_annotate(model, x = text)
df <- as.data.frame(result)

Tidyverse Integration

library(dplyr)

# Count POS tags
df %>%
  count(upos, sort = TRUE)

# Extract nouns with their lemmas
df %>%
  filter(upos == "NOUN") %>%
  select(token, lemma, feats)

# Get dependency relations
df %>%
  select(token_id, token, head_token_id, dep_rel, upos)

Replacing a Stock Model

If you previously used a stock UDPipe Latin model, replace it by pointing to this model file instead:

# Before (stock model)
# model <- udpipe_download_model(language = "latin-ittb")
# model <- udpipe_load_model(model$file_model)

# After (LatinCy model)
model <- udpipe_load_model("la_udpipe_latincy_multi.udpipe")
# Everything else stays the same

Limitations

  • UPOS/UFeats slightly below baseline: The two-model tagger configuration trades ~1.3 UPOS / ~1.6 UFeats points for +4.2 LAS and +1.9 Lemma. The overall trade-off is positive for downstream tasks.
  • UDPipe 1 architecture: Feature-based model (no transformers). For higher accuracy, consider LatinCy spaCy models (la_core_web_trf: 79.8 LAS).
  • Gold tokenization evaluation: Reported scores use gold tokenization. Real-world performance on raw text depends on the tokenizer (99.6% F1 on held-out data).

References

  • Straka, M., Hajič, J., and Straková, J. 2016. "UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing." In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., et al. eds. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA). 4290–97. https://aclanthology.org/L16-1680/.

Citation

@misc{burns2026latincyudpipe,
  author = {Burns, Patrick J.},
  title = {{LatinCy UDPipe (la\_udpipe\_latincy\_multi)}},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/latincy/la_udpipe_latincy},
}

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results