File size: 2,039 Bytes
e780532
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1802a5c
e780532
428c2d7
e780532
c25e61f
e780532
df78c0f
e780532
1802a5c
c25e61f
 
 
 
 
 
e780532
1802a5c
e780532
c25e61f
 
0dad7b9
e780532
 
 
 
df78c0f
e780532
 
 
 
 
0dad7b9
 
e780532
 
 
0dad7b9
 
e780532
 
1802a5c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
language: la
license: apache-2.0
tags:
- latin
- lemmatization
- byt5
- nlp
- sota
datasets:
- universal_dependencies
metrics:
- accuracy
---

# THIVLVC: Latin ByT5 Lemmatizer

**THIVLVC** is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at **LISN (CNRS)** to provide a high-performance, unified model for diverse Latin corpora.

## Performance Analysis

The following table compares **THIVLVC** against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.

| Benchmark | **THIVLVC** | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Perseus (Poetry) | **93.48%** | 91.04% | 70.34% | 91.44% | 91.14% |
| UDante (Medieval) | **85.85%** | 84.80% | - | 78.08% | - |
| PROIEL (Classical) | **97.29%** | 96.65% | 97.21% | 90.88% | - |
| ITTB (Scholastic) | 98.64% | 99.03% | **99.13%** | 96.50% | - |
| LLCT (Late Latin) | 88.92% | **97.40%** | 96.2% | 97.10% | - |

**THIVLVC** achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts.

## Usage

**Important**: For best results, especially on short sentences or fragments, use **beam search** (`num_beams=5`).

```python
from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "Zual/THIVLVC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def lemmatize(text):
    inputs = tokenizer(text, return_tensors="pt")
    # Using beam search (num_beams=5) for better accuracy
    outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
print(lemmatize("Amorem canat")) 
# Expected Output: "amor cano"
```

This model was produced by **Luc Pommeret** at LISN (CNRS, Université Paris-Saclay).