Zual commited on
Commit
1802a5c
·
verified ·
1 Parent(s): c25e61f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -7
README.md CHANGED
@@ -13,15 +13,15 @@ metrics:
13
  - accuracy
14
  ---
15
 
16
- # Latin ByT5 Lemmatizer
17
 
18
- This is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at LISN (CNRS) to provide a high-performance, unified model for diverse Latin corpora.
19
 
20
  ## Performance Analysis
21
 
22
- The following table compares this model against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.
23
 
24
- | Benchmark | Our ByT5 | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
25
  | :--- | :---: | :---: | :---: | :---: | :---: |
26
  | Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% |
27
  | UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - |
@@ -29,7 +29,7 @@ The following table compares this model against major industry standards across
29
  | ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - |
30
  | LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - |
31
 
32
- The model achieves state-of-the-art results on three major benchmarks: Perseus, UDante, and PROIEL. It is particularly effective for complex literary and medieval texts.
33
 
34
  ## Usage
35
 
@@ -59,9 +59,11 @@ print(lemmatize("Amorem canat"))
59
  ## Dataset and Training
60
 
61
  - **Model Architecture**: ByT5-base
62
- - **Training Data**: Unified corpus including Universal Dependencies gold standard, massive silver data from the Latin Library, and targeted distillation.
 
 
63
  - **Scope**: Unified lemmatization across multiple historical periods and genres of Latin.
64
 
65
  ## Acknowledgments
66
 
67
- This model was produced by Zual at LISN (CNRS, Université Paris-Saclay).
 
13
  - accuracy
14
  ---
15
 
16
+ # THIVLVC: Latin ByT5 Lemmatizer
17
 
18
+ **THIVLVC** is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed by **Luc Pommeret** at **LISN (CNRS)** to provide a high-performance, unified model for diverse Latin corpora.
19
 
20
  ## Performance Analysis
21
 
22
+ The following table compares **THIVLVC** against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.
23
 
24
+ | Benchmark | **THIVLVC** | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
25
  | :--- | :---: | :---: | :---: | :---: | :---: |
26
  | Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% |
27
  | UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - |
 
29
  | ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - |
30
  | LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - |
31
 
32
+ **THIVLVC** achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts.
33
 
34
  ## Usage
35
 
 
59
  ## Dataset and Training
60
 
61
  - **Model Architecture**: ByT5-base
62
+ - **Author**: Luc Pommeret
63
+ - **Institution**: LISN (CNRS, Université Paris-Saclay)
64
+ - **Training Data**: Unified corpus including Universal Dependencies gold standard, massive silver data from the Latin Library, and targeted distillation from Gemini.
65
  - **Scope**: Unified lemmatization across multiple historical periods and genres of Latin.
66
 
67
  ## Acknowledgments
68
 
69
+ This model was produced by **Luc Pommeret** at LISN (CNRS, Université Paris-Saclay).