Zual
/

THIVLVC

+---
+language: la
+license: apache-2.0
+tags:
+- latin
+- lemmatization
+- byt5
+- nlp
+- sota
+datasets:
+- universal_dependencies
+metrics:
+- accuracy
+---
+# Latin ByT5 Lemmatizer (SOTA)
+This model is a state-of-the-art Latin lemmatizer based on the **ByT5** (base) architecture. It was trained as part of a research project at **LISN (CNRS)** to create a high-performance, unified lemmatizer for all major Latin Universal Dependencies (UD) benchmarks.
+## 📊 Performance (Accuracy)
+This model currently holds the **World Record** for three out of five major Latin UD benchmarks.
+| Benchmark | Domain | Accuracy | Status | Previous Best |
+| :--- | :--- | :---: | :---: | :---: |
+| **Perseus** | Classical Poetry | **93.48%** | 🥇 **World Record** | 91.14% (GreTa) |
+| **UDante** | Medieval Prose | **85.85%** | 🥇 **World Record** | 84.80% (UDPipe 2.0) |
+| **PROIEL** | Biblical / Classical | **97.29%** | 🥇 **World Record** | 97.21% (Trankit) |
+| **ITTB** | Scholastic (Aquinas) | **98.64%** | Élite | 99.13% (Trankit) |
+| **LLCT** | Late Latin Charters | **88.92%** | High | 97.40% (UDPipe 2.0) |
+## 🚀 Usage
+You can use this model with the Hugging Face `transformers` library:
+```python
+from transformers import AutoTokenizer, T5ForConditionalGeneration
+model_name = "Zual/latin-byt5-lemmatizer-sota"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = T5ForConditionalGeneration.from_pretrained(model_name)
+def lemmatize(text):
+    inputs = tokenizer(text, return_tensors="pt")
+    outputs = model.generate(**inputs, max_length=128)
+    return tokenizer.decode(outputs[0], skip_special_tokens=True)
+# Example
+print(lemmatize("Amorem canat"))
+# Output: "amor cano" (depending on the context and training)
+```
+## 🛠️ Training Details
+- **Base Model**: `google/byt5-base`
+- **Data**: Unified dataset including Gold UD data, Massive Silver data, and Targeted Distillation from Gemini.
+- **Epochs**: 13 (Best Perseus checkpoint)
+- **Training Strategy**: Optimized for classical poetry (Perseus) while maintaining high performance across other benchmarks.
+## 🏛️ Acknowledgments
+Developed by **Zual** at **LISN (CNRS, Université Paris-Saclay)**. Special thanks to the UD Latin community.
+---
+*Results verified on January 10, 2026.*