|
|
--- |
|
|
language: la |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- latin |
|
|
- lemmatization |
|
|
- byt5 |
|
|
- nlp |
|
|
- sota |
|
|
datasets: |
|
|
- universal_dependencies |
|
|
metrics: |
|
|
- accuracy |
|
|
--- |
|
|
|
|
|
# THIVLVC: Latin ByT5 Lemmatizer |
|
|
|
|
|
**THIVLVC** is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at **LISN (CNRS)** to provide a high-performance, unified model for diverse Latin corpora. |
|
|
|
|
|
## Performance Analysis |
|
|
|
|
|
The following table compares **THIVLVC** against major industry standards across the five Universal Dependencies (UD) Latin benchmarks. |
|
|
|
|
|
| Benchmark | **THIVLVC** | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) | |
|
|
| :--- | :---: | :---: | :---: | :---: | :---: | |
|
|
| Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% | |
|
|
| UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - | |
|
|
| PROIEL (Classical) | **97.29%** | 96.65% | 97.21% | 90.88% | - | |
|
|
| ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - | |
|
|
| LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - | |
|
|
|
|
|
**THIVLVC** achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts. |
|
|
|
|
|
## Usage |
|
|
|
|
|
**Important**: For best results, especially on short sentences or fragments, use **beam search** (`num_beams=5`). |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, T5ForConditionalGeneration |
|
|
|
|
|
model_name = "Zual/THIVLVC" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = T5ForConditionalGeneration.from_pretrained(model_name) |
|
|
|
|
|
def lemmatize(text): |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
# Using beam search (num_beams=5) for better accuracy |
|
|
outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True) |
|
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
# Example |
|
|
print(lemmatize("Amorem canat")) |
|
|
# Expected Output: "amor cano" |
|
|
``` |
|
|
|
|
|
This model was produced by **Luc Pommeret** at LISN (CNRS, Université Paris-Saclay). |
|
|
|