--- language: la license: apache-2.0 tags: - latin - lemmatization - byt5 - nlp - sota datasets: - universal_dependencies metrics: - accuracy --- # THIVLVC: Latin ByT5 Lemmatizer **THIVLVC** is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at **LISN (CNRS)** to provide a high-performance, unified model for diverse Latin corpora. ## Performance Analysis The following table compares **THIVLVC** against major industry standards across the five Universal Dependencies (UD) Latin benchmarks. | Benchmark | **THIVLVC** | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) | | :--- | :---: | :---: | :---: | :---: | :---: | | Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% | | UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - | | PROIEL (Classical) | **97.29%** | 96.65% | 97.21% | 90.88% | - | | ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - | | LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - | **THIVLVC** achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts. ## Usage **Important**: For best results, especially on short sentences or fragments, use **beam search** (`num_beams=5`). ```python from transformers import AutoTokenizer, T5ForConditionalGeneration model_name = "Zual/THIVLVC" tokenizer = AutoTokenizer.from_pretrained(model_name) model = T5ForConditionalGeneration.from_pretrained(model_name) def lemmatize(text): inputs = tokenizer(text, return_tensors="pt") # Using beam search (num_beams=5) for better accuracy outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Example print(lemmatize("Amorem canat")) # Expected Output: "amor cano" ``` This model was produced by **Luc Pommeret** at LISN (CNRS, Université Paris-Saclay).