Zual
/

THIVLVC

@@ -13,25 +13,32 @@ metrics:
 - accuracy
 ---
-# Latin ByT5 Lemmatizer (SOTA)
-This model is a state-of-the-art Latin lemmatizer based on the **ByT5** (base) architecture. It was trained as part of a research project at **LISN (CNRS)** to create a high-performance, unified lemmatizer for all major Latin Universal Dependencies (UD) benchmarks.
-## 📊 Performance (Accuracy)
-This model currently holds the **World Record** for three out of five major Latin UD benchmarks.
-| Benchmark | Domain | Accuracy | Status | Previous Best |
-| :--- | :--- | :---: | :---: | :---: |
-| **Perseus** | Classical Poetry | **93.48%** | 🥇 **World Record** | 91.14% (GreTa) |
-| **UDante** | Medieval Prose | **85.85%** | 🥇 **World Record** | 84.80% (UDPipe 2.0) |
-| **PROIEL** | Biblical / Classical | **97.29%** | 🥇 **World Record** | 97.21% (Trankit) |
-| **ITTB** | Scholastic (Aquinas) | **98.64%** | Élite | 99.13% (Trankit) |
-| **LLCT** | Late Latin Charters | **88.92%** | High | 97.40% (UDPipe 2.0) |
-## 🚀 Usage
-You can use this model with the Hugging Face `transformers` library:
 ```python
 from transformers import AutoTokenizer, T5ForConditionalGeneration
@@ -46,20 +53,15 @@ def lemmatize(text):
     return tokenizer.decode(outputs[0], skip_special_tokens=True)
 # Example
-print(lemmatize("Amorem canat"))
-# Output: "amor cano" (depending on the context and training)
 ```
-## 🛠️ Training Details
-- **Base Model**: `google/byt5-base`
-- **Data**: Unified dataset including Gold UD data, Massive Silver data, and Targeted Distillation from Gemini.
-- **Epochs**: 13 (Best Perseus checkpoint)
-- **Training Strategy**: Optimized for classical poetry (Perseus) while maintaining high performance across other benchmarks.
-## 🏛️ Acknowledgments
-Developed by **Zual** at **LISN (CNRS, Université Paris-Saclay)**. Special thanks to the UD Latin community.
----
-*Results verified on January 10, 2026.*

 - accuracy
 ---
+# Latin ByT5 Lemmatizer
+This is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at LISN (CNRS) to provide a high-performance, unified model for diverse Latin corpora.
+## Performance Analysis
+The following table compares this model against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.
+| Benchmark | Our ByT5 | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% |
+| UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - |
+| PROIEL (Classical) | **97.29%** | 96.65% | 97.21% | 90.88% | - |
+| ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - |
+| LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - |
+The model achieves state-of-the-art results on three major benchmarks: Perseus, UDante, and PROIEL. It is particularly effective for complex literary and medieval texts.
+## Usage
+Installation:
+```bash
+pip install transformers torch
+```
+Basic usage in Python:
 ```python
 from transformers import AutoTokenizer, T5ForConditionalGeneration
     return tokenizer.decode(outputs[0], skip_special_tokens=True)
 # Example
+print(lemmatize("Amorem canat"))
 ```
+## Dataset and Training
+- **Model Architecture**: ByT5-base
+- **Training Data**: Unified corpus including Universal Dependencies gold standard, massive silver data from the Latin Library, and targeted distillation.
+- **Scope**: Unified lemmatization across multiple historical periods and genres of Latin.
+## Acknowledgments
+This model was produced by Zual at LISN (CNRS, Université Paris-Saclay).