Emendator / README.md
aimgo's picture
Update README.md
0dc93d3 verified
---
license: cc-by-nc-4.0
pipeline_tag: text-generation
language:
- la
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/TzFC1Lo2kZ_38hAay9Wf2.png"
style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />
**Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text.
**This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards that which it has been trained on.**
This is to say: **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations. As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligible Latin of a *particular* style.
The model is intended to be used on segments of **250** characters. Anything else will compromise performance.
### Lightly Corrupted Text
Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n",
Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
### Heavily Corrupted Text
Orig: "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum."
OCR: "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«"
Emendator Reconstruction: "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum"
### Severely Corrupted Text
Orig: "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus."
OCR: "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«"
Emendator Reconstruction: "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus"
To use Emendator, you can load it via the Transformers library:
```python
import torch
from transformers import T5ForConditionalGeneration, AutoTokenizer
model_path = "aimgo/Emendator"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
model.eval()
texts = ["Nil igirur rnors cft ad nos ncq;pcrtinct hilurn»", "Vt quod ali cibus eft aliis fuat acre uenenurn."]
enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256).to(device)
with torch.no_grad():
outputs = model.generate(
enc["input_ids"],
attention_mask=enc["attention_mask"],
max_new_tokens=enc["input_ids"].shape[1] + 32,
num_beams=4,
do_sample=False,
early_stopping=True,
repetition_penalty=1.15,
)
corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)
```
If you use this in your work, please cite:
```
@misc{mccarthy2026Emendator,
author = {McCarthy, A. M.},
title = {{Emendator}: Latin OCR Artifact Correction},
year = {2026},
howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
note = {Model}
}
```