Zual
/

THIVLVC

Model card Files Files and versions

THIVLVC / README.md

Zual's picture

Update README.md

428c2d7 verified 3 days ago

|

history blame contribute delete

2.04 kB

	---
	language: la
	license: apache-2.0
	tags:
	- latin
	- lemmatization
	- byt5
	- nlp
	- sota
	datasets:
	- universal_dependencies
	metrics:
	- accuracy
	---

	# THIVLVC: Latin ByT5 Lemmatizer

	THIVLVC is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at LISN (CNRS) to provide a high-performance, unified model for diverse Latin corpora.

	## Performance Analysis

	The following table compares THIVLVC against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.

	\| Benchmark \| THIVLVC \| UDPipe 2.0 \| Trankit (XLM-R) \| Stanza (v1.5) \| GreTa (T5) \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Perseus (Poetry) \| 93.48% \| 91.04% \| 70.34% \| 91.44% \| 91.14% \|
	\| UDante (Medieval) \| 85.85% \| 84.80% \| - \| 78.08% \| - \|
	\| PROIEL (Classical) \| 97.29% \| 96.65% \| 97.21% \| 90.88% \| - \|
	\| ITTB (Scholastic) \| 98.64% \| 99.03% \| 99.13% \| 96.50% \| - \|
	\| LLCT (Late Latin) \| 88.92% \| 97.40% \| 96.2% \| 97.10% \| - \|

	THIVLVC achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts.

	## Usage

	Important: For best results, especially on short sentences or fragments, use beam search (`num_beams=5`).

	```python
	from transformers import AutoTokenizer, T5ForConditionalGeneration

	model_name = "Zual/THIVLVC"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = T5ForConditionalGeneration.from_pretrained(model_name)

	def lemmatize(text):
	inputs = tokenizer(text, return_tensors="pt")
	# Using beam search (num_beams=5) for better accuracy
	outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
	return tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Example
	print(lemmatize("Amorem canat"))
	# Expected Output: "amor cano"
	```

	This model was produced by Luc Pommeret at LISN (CNRS, Université Paris-Saclay).