Zual commited on
Commit
c25e61f
·
verified ·
1 Parent(s): e780532

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +27 -25
README.md CHANGED
@@ -13,25 +13,32 @@ metrics:
13
  - accuracy
14
  ---
15
 
16
- # Latin ByT5 Lemmatizer (SOTA)
17
 
18
- This model is a state-of-the-art Latin lemmatizer based on the **ByT5** (base) architecture. It was trained as part of a research project at **LISN (CNRS)** to create a high-performance, unified lemmatizer for all major Latin Universal Dependencies (UD) benchmarks.
19
 
20
- ## 📊 Performance (Accuracy)
21
 
22
- This model currently holds the **World Record** for three out of five major Latin UD benchmarks.
23
 
24
- | Benchmark | Domain | Accuracy | Status | Previous Best |
25
- | :--- | :--- | :---: | :---: | :---: |
26
- | **Perseus** | Classical Poetry | **93.48%** | 🥇 **World Record** | 91.14% (GreTa) |
27
- | **UDante** | Medieval Prose | **85.85%** | 🥇 **World Record** | 84.80% (UDPipe 2.0) |
28
- | **PROIEL** | Biblical / Classical | **97.29%** | 🥇 **World Record** | 97.21% (Trankit) |
29
- | **ITTB** | Scholastic (Aquinas) | **98.64%** | Élite | 99.13% (Trankit) |
30
- | **LLCT** | Late Latin Charters | **88.92%** | High | 97.40% (UDPipe 2.0) |
31
 
32
- ## 🚀 Usage
33
 
34
- You can use this model with the Hugging Face `transformers` library:
 
 
 
 
 
 
 
35
 
36
  ```python
37
  from transformers import AutoTokenizer, T5ForConditionalGeneration
@@ -46,20 +53,15 @@ def lemmatize(text):
46
  return tokenizer.decode(outputs[0], skip_special_tokens=True)
47
 
48
  # Example
49
- print(lemmatize("Amorem canat"))
50
- # Output: "amor cano" (depending on the context and training)
51
  ```
52
 
53
- ## 🛠️ Training Details
54
 
55
- - **Base Model**: `google/byt5-base`
56
- - **Data**: Unified dataset including Gold UD data, Massive Silver data, and Targeted Distillation from Gemini.
57
- - **Epochs**: 13 (Best Perseus checkpoint)
58
- - **Training Strategy**: Optimized for classical poetry (Perseus) while maintaining high performance across other benchmarks.
59
 
60
- ## 🏛️ Acknowledgments
61
 
62
- Developed by **Zual** at **LISN (CNRS, Université Paris-Saclay)**. Special thanks to the UD Latin community.
63
-
64
- ---
65
- *Results verified on January 10, 2026.*
 
13
  - accuracy
14
  ---
15
 
16
+ # Latin ByT5 Lemmatizer
17
 
18
+ This is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at LISN (CNRS) to provide a high-performance, unified model for diverse Latin corpora.
19
 
20
+ ## Performance Analysis
21
 
22
+ The following table compares this model against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.
23
 
24
+ | Benchmark | Our ByT5 | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
25
+ | :--- | :---: | :---: | :---: | :---: | :---: |
26
+ | Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% |
27
+ | UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - |
28
+ | PROIEL (Classical) | **97.29%** | 96.65% | 97.21% | 90.88% | - |
29
+ | ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - |
30
+ | LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - |
31
 
32
+ The model achieves state-of-the-art results on three major benchmarks: Perseus, UDante, and PROIEL. It is particularly effective for complex literary and medieval texts.
33
 
34
+ ## Usage
35
+
36
+ Installation:
37
+ ```bash
38
+ pip install transformers torch
39
+ ```
40
+
41
+ Basic usage in Python:
42
 
43
  ```python
44
  from transformers import AutoTokenizer, T5ForConditionalGeneration
 
53
  return tokenizer.decode(outputs[0], skip_special_tokens=True)
54
 
55
  # Example
56
+ print(lemmatize("Amorem canat"))
 
57
  ```
58
 
59
+ ## Dataset and Training
60
 
61
+ - **Model Architecture**: ByT5-base
62
+ - **Training Data**: Unified corpus including Universal Dependencies gold standard, massive silver data from the Latin Library, and targeted distillation.
63
+ - **Scope**: Unified lemmatization across multiple historical periods and genres of Latin.
 
64
 
65
+ ## Acknowledgments
66
 
67
+ This model was produced by Zual at LISN (CNRS, Université Paris-Saclay).