| | --- |
| | license: cc-by-nc-4.0 |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/TzFC1Lo2kZ_38hAay9Wf2.png" |
| | style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" /> |
| | |
| | **Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text. |
| | |
| | **This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards that which it has been trained on.** |
| | This is to say: **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations. As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligible Latin of a *particular* style. |
| |
|
| |
|
| | The model is intended to be used on segments of **250** characters. Anything else will compromise performance. |
| |
|
| | ### Lightly Corrupted Text |
| | Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto", |
| | |
| | OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n", |
| | |
| | Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto", |
| | ### Heavily Corrupted Text |
| | Orig: "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum." |
| | |
| | OCR: "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«" |
| | |
| | Emendator Reconstruction: "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum" |
| | ### Severely Corrupted Text |
| | |
| | Orig: "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus." |
| | |
| | OCR: "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«" |
| | |
| | Emendator Reconstruction: "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus" |
| | |
| | To use Emendator, you can load it via the Transformers library: |
| |
|
| | ```python |
| | import torch |
| | from transformers import T5ForConditionalGeneration, AutoTokenizer |
| | |
| | model_path = "aimgo/Emendator" |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_path) |
| | model = T5ForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device) |
| | model.eval() |
| | |
| | texts = ["Nil igirur rnors cft ad nos ncq;pcrtinct hilurn»", "Vt quod ali cibus eft aliis fuat acre uenenurn."] |
| | enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256).to(device) |
| | |
| | with torch.no_grad(): |
| | outputs = model.generate( |
| | enc["input_ids"], |
| | attention_mask=enc["attention_mask"], |
| | max_new_tokens=enc["input_ids"].shape[1] + 32, |
| | num_beams=4, |
| | do_sample=False, |
| | early_stopping=True, |
| | repetition_penalty=1.15, |
| | ) |
| | |
| | corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
| | ``` |
| |
|
| | If you use this in your work, please cite: |
| | ``` |
| | @misc{mccarthy2026Emendator, |
| | author = {McCarthy, A. M.}, |
| | title = {{Emendator}: Latin OCR Artifact Correction}, |
| | year = {2026}, |
| | howpublished = {\url{https://huggingface.co/aimgo/Emendator}}, |
| | note = {Model} |
| | } |
| | ``` |