Zual commited on
Commit
df78c0f
·
verified ·
1 Parent(s): cc5e2d2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +13 -3
README.md CHANGED
@@ -15,11 +15,11 @@ metrics:
15
 
16
  # THIVLVC: Latin ByT5 Lemmatizer
17
 
18
- **THIVLVC** is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at **LISN (CNRS)** to provide a high-performance, unified model for diverse Latin corpora.
19
 
20
  ## Performance Analysis
21
 
22
- The following table compares **THIVLVC** against industry standards on the five Universal Dependencies (UD) Latin benchmarks.
23
 
24
  | Benchmark | **THIVLVC** | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
25
  | :--- | :---: | :---: | :---: | :---: | :---: |
@@ -43,7 +43,7 @@ Basic usage in Python:
43
  ```python
44
  from transformers import AutoTokenizer, T5ForConditionalGeneration
45
 
46
- model_name = "Zual/latin-byt5-lemmatizer-sota"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  model = T5ForConditionalGeneration.from_pretrained(model_name)
49
 
@@ -56,4 +56,14 @@ def lemmatize(text):
56
  print(lemmatize("Amorem canat"))
57
  ```
58
 
 
 
 
 
 
 
 
 
 
 
59
  This model was produced by **Luc Pommeret** at LISN (CNRS, Université Paris-Saclay).
 
15
 
16
  # THIVLVC: Latin ByT5 Lemmatizer
17
 
18
+ **THIVLVC** is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed by **Luc Pommeret** at **LISN (CNRS)** to provide a high-performance, unified model for diverse Latin corpora.
19
 
20
  ## Performance Analysis
21
 
22
+ The following table compares **THIVLVC** against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.
23
 
24
  | Benchmark | **THIVLVC** | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
25
  | :--- | :---: | :---: | :---: | :---: | :---: |
 
43
  ```python
44
  from transformers import AutoTokenizer, T5ForConditionalGeneration
45
 
46
+ model_name = "Zual/THIVLVC"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  model = T5ForConditionalGeneration.from_pretrained(model_name)
49
 
 
56
  print(lemmatize("Amorem canat"))
57
  ```
58
 
59
+ ## Dataset and Training
60
+
61
+ - **Model Architecture**: ByT5-base
62
+ - **Author**: Luc Pommeret
63
+ - **Institution**: LISN (CNRS, Université Paris-Saclay)
64
+ - **Training Data**: Unified corpus including Universal Dependencies gold standard, massive silver data from the Latin Library, and targeted distillation from Gemini.
65
+ - **Scope**: Unified lemmatization across multiple historical periods and genres of Latin.
66
+
67
+ ## Acknowledgments
68
+
69
  This model was produced by **Luc Pommeret** at LISN (CNRS, Université Paris-Saclay).