File size: 2,039 Bytes
e780532 1802a5c e780532 428c2d7 e780532 c25e61f e780532 df78c0f e780532 1802a5c c25e61f e780532 1802a5c e780532 c25e61f 0dad7b9 e780532 df78c0f e780532 0dad7b9 e780532 0dad7b9 e780532 1802a5c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
---
language: la
license: apache-2.0
tags:
- latin
- lemmatization
- byt5
- nlp
- sota
datasets:
- universal_dependencies
metrics:
- accuracy
---
# THIVLVC: Latin ByT5 Lemmatizer
**THIVLVC** is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at **LISN (CNRS)** to provide a high-performance, unified model for diverse Latin corpora.
## Performance Analysis
The following table compares **THIVLVC** against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.
| Benchmark | **THIVLVC** | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% |
| UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - |
| PROIEL (Classical) | **97.29%** | 96.65% | 97.21% | 90.88% | - |
| ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - |
| LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - |
**THIVLVC** achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts.
## Usage
**Important**: For best results, especially on short sentences or fragments, use **beam search** (`num_beams=5`).
```python
from transformers import AutoTokenizer, T5ForConditionalGeneration
model_name = "Zual/THIVLVC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
def lemmatize(text):
inputs = tokenizer(text, return_tensors="pt")
# Using beam search (num_beams=5) for better accuracy
outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example
print(lemmatize("Amorem canat"))
# Expected Output: "amor cano"
```
This model was produced by **Luc Pommeret** at LISN (CNRS, Université Paris-Saclay).
|