Zual commited on
Commit
e780532
·
verified ·
1 Parent(s): d6d970b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: la
3
+ license: apache-2.0
4
+ tags:
5
+ - latin
6
+ - lemmatization
7
+ - byt5
8
+ - nlp
9
+ - sota
10
+ datasets:
11
+ - universal_dependencies
12
+ metrics:
13
+ - accuracy
14
+ ---
15
+
16
+ # Latin ByT5 Lemmatizer (SOTA)
17
+
18
+ This model is a state-of-the-art Latin lemmatizer based on the **ByT5** (base) architecture. It was trained as part of a research project at **LISN (CNRS)** to create a high-performance, unified lemmatizer for all major Latin Universal Dependencies (UD) benchmarks.
19
+
20
+ ## 📊 Performance (Accuracy)
21
+
22
+ This model currently holds the **World Record** for three out of five major Latin UD benchmarks.
23
+
24
+ | Benchmark | Domain | Accuracy | Status | Previous Best |
25
+ | :--- | :--- | :---: | :---: | :---: |
26
+ | **Perseus** | Classical Poetry | **93.48%** | 🥇 **World Record** | 91.14% (GreTa) |
27
+ | **UDante** | Medieval Prose | **85.85%** | 🥇 **World Record** | 84.80% (UDPipe 2.0) |
28
+ | **PROIEL** | Biblical / Classical | **97.29%** | 🥇 **World Record** | 97.21% (Trankit) |
29
+ | **ITTB** | Scholastic (Aquinas) | **98.64%** | Élite | 99.13% (Trankit) |
30
+ | **LLCT** | Late Latin Charters | **88.92%** | High | 97.40% (UDPipe 2.0) |
31
+
32
+ ## 🚀 Usage
33
+
34
+ You can use this model with the Hugging Face `transformers` library:
35
+
36
+ ```python
37
+ from transformers import AutoTokenizer, T5ForConditionalGeneration
38
+
39
+ model_name = "Zual/latin-byt5-lemmatizer-sota"
40
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
41
+ model = T5ForConditionalGeneration.from_pretrained(model_name)
42
+
43
+ def lemmatize(text):
44
+ inputs = tokenizer(text, return_tensors="pt")
45
+ outputs = model.generate(**inputs, max_length=128)
46
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
47
+
48
+ # Example
49
+ print(lemmatize("Amorem canat"))
50
+ # Output: "amor cano" (depending on the context and training)
51
+ ```
52
+
53
+ ## 🛠️ Training Details
54
+
55
+ - **Base Model**: `google/byt5-base`
56
+ - **Data**: Unified dataset including Gold UD data, Massive Silver data, and Targeted Distillation from Gemini.
57
+ - **Epochs**: 13 (Best Perseus checkpoint)
58
+ - **Training Strategy**: Optimized for classical poetry (Perseus) while maintaining high performance across other benchmarks.
59
+
60
+ ## 🏛️ Acknowledgments
61
+
62
+ Developed by **Zual** at **LISN (CNRS, Université Paris-Saclay)**. Special thanks to the UD Latin community.
63
+
64
+ ---
65
+ *Results verified on January 10, 2026.*