--- license: cc-by-nc-4.0 pipeline_tag: text-generation language: - la --- **Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text. **This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards that which it has been trained on.** This is to say: **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations. As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligible Latin of a *particular* style. The model is intended to be used on segments of **250** characters. Anything else will compromise performance. ### Lightly Corrupted Text Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto", OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n", Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto", ### Heavily Corrupted Text Orig: "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum." OCR: "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«" Emendator Reconstruction: "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum" ### Severely Corrupted Text Orig: "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus." OCR: "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«" Emendator Reconstruction: "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus" To use Emendator, you can load it via the Transformers library: ```python import torch from transformers import T5ForConditionalGeneration, AutoTokenizer model_path = "aimgo/Emendator" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_path) model = T5ForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device) model.eval() texts = ["Nil igirur rnors cft ad nos ncq;pcrtinct hilurn»", "Vt quod ali cibus eft aliis fuat acre uenenurn."] enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256).to(device) with torch.no_grad(): outputs = model.generate( enc["input_ids"], attention_mask=enc["attention_mask"], max_new_tokens=enc["input_ids"].shape[1] + 32, num_beams=4, do_sample=False, early_stopping=True, repetition_penalty=1.15, ) corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` If you use this in your work, please cite: ``` @misc{mccarthy2026Emendator, author = {McCarthy, A. M.}, title = {{Emendator}: Latin OCR Artifact Correction}, year = {2026}, howpublished = {\url{https://huggingface.co/aimgo/Emendator}}, note = {Model} } ```