File size: 3,612 Bytes
f48a4e5 2bbd26a 0dc93d3 f48a4e5 abfdcf2 ef7f000 ca69b54 84944d7 47c19e3 3f5715d 095f09c 3f5715d c12d37e a25fb3e 3104475 3f5715d f66fe1f 3f5715d 3104475 7552ad5 3cfc949 7552ad5 3cfc949 7552ad5 3cfc949 7552ad5 3cfc949 7552ad5 3cfc949 7552ad5 3cfc949 7552ad5 3104475 3cfc949 3104475 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
license: cc-by-nc-4.0
pipeline_tag: text-generation
language:
- la
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/TzFC1Lo2kZ_38hAay9Wf2.png"
style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />
**Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text.
**This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards that which it has been trained on.**
This is to say: **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations. As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligible Latin of a *particular* style.
The model is intended to be used on segments of **250** characters. Anything else will compromise performance.
### Lightly Corrupted Text
Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n",
Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
### Heavily Corrupted Text
Orig: "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum."
OCR: "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«"
Emendator Reconstruction: "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum"
### Severely Corrupted Text
Orig: "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus."
OCR: "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«"
Emendator Reconstruction: "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus"
To use Emendator, you can load it via the Transformers library:
```python
import torch
from transformers import T5ForConditionalGeneration, AutoTokenizer
model_path = "aimgo/Emendator"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
model.eval()
texts = ["Nil igirur rnors cft ad nos ncq;pcrtinct hilurn»", "Vt quod ali cibus eft aliis fuat acre uenenurn."]
enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256).to(device)
with torch.no_grad():
outputs = model.generate(
enc["input_ids"],
attention_mask=enc["attention_mask"],
max_new_tokens=enc["input_ids"].shape[1] + 32,
num_beams=4,
do_sample=False,
early_stopping=True,
repetition_penalty=1.15,
)
corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)
```
If you use this in your work, please cite:
```
@misc{mccarthy2026Emendator,
author = {McCarthy, A. M.},
title = {{Emendator}: Latin OCR Artifact Correction},
year = {2026},
howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
note = {Model}
}
``` |