|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- la |
|
|
--- |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/TzFC1Lo2kZ_38hAay9Wf2.png" |
|
|
style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" /> |
|
|
|
|
|
**Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text. |
|
|
|
|
|
**This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards that which it has been trained on.** |
|
|
This is to say: **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations. As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligible Latin of a *particular* style. |
|
|
|
|
|
|
|
|
The model is intended to be used on segments of **250** characters. Anything else will compromise performance. |
|
|
|
|
|
### Lightly Corrupted Text |
|
|
Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto", |
|
|
|
|
|
OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n", |
|
|
|
|
|
Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto", |
|
|
### Heavily Corrupted Text |
|
|
Orig: "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum." |
|
|
|
|
|
OCR: "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«" |
|
|
|
|
|
Emendator Reconstruction: "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum" |
|
|
### Severely Corrupted Text |
|
|
|
|
|
Orig: "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus." |
|
|
|
|
|
OCR: "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«" |
|
|
|
|
|
Emendator Reconstruction: "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus" |
|
|
|
|
|
To use Emendator, you can load it via the Transformers library: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import T5ForConditionalGeneration, AutoTokenizer |
|
|
|
|
|
model_path = "aimgo/Emendator" |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
model = T5ForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device) |
|
|
model.eval() |
|
|
|
|
|
texts = ["Nil igirur rnors cft ad nos ncq;pcrtinct hilurn»", "Vt quod ali cibus eft aliis fuat acre uenenurn."] |
|
|
enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
enc["input_ids"], |
|
|
attention_mask=enc["attention_mask"], |
|
|
max_new_tokens=enc["input_ids"].shape[1] + 32, |
|
|
num_beams=4, |
|
|
do_sample=False, |
|
|
early_stopping=True, |
|
|
repetition_penalty=1.15, |
|
|
) |
|
|
|
|
|
corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
If you use this in your work, please cite: |
|
|
``` |
|
|
@misc{mccarthy2026Emendator, |
|
|
author = {McCarthy, A. M.}, |
|
|
title = {{Emendator}: Latin OCR Artifact Correction}, |
|
|
year = {2026}, |
|
|
howpublished = {\url{https://huggingface.co/aimgo/Emendator}}, |
|
|
note = {Model} |
|
|
} |
|
|
``` |