File size: 3,612 Bytes
f48a4e5
 
2bbd26a
0dc93d3
 
f48a4e5
abfdcf2
 
 
 
ef7f000
ca69b54
84944d7
47c19e3
3f5715d
 
 
095f09c
3f5715d
c12d37e
a25fb3e
 
 
3104475
3f5715d
 
 
 
 
f66fe1f
3f5715d
 
 
 
 
 
 
3104475
7552ad5
 
 
3cfc949
7552ad5
 
3cfc949
 
7552ad5
3cfc949
7552ad5
 
 
3cfc949
 
 
 
7552ad5
 
 
3cfc949
7552ad5
 
 
 
 
3cfc949
 
7552ad5
 
3104475
 
 
 
 
 
3cfc949
3104475
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: cc-by-nc-4.0
pipeline_tag: text-generation
language:
- la
---

<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/TzFC1Lo2kZ_38hAay9Wf2.png"
     style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />
     
**Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text. 

**This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards that which it has been trained on.** 
This is to say: **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations. As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligible Latin of a *particular* style.


The model is intended to be used on segments of **250** characters. Anything else will compromise performance. 

### Lightly Corrupted Text
      Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
      
      OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n",
      
      Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
### Heavily Corrupted Text
      Orig:       "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum."
      
      OCR:       "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«"
      
      Emendator Reconstruction:  "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum"
### Severely Corrupted Text

      Orig:       "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus."
      
      OCR:        "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«"
      
      Emendator Reconstruction:  "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus"

To use Emendator, you can load it via the Transformers library:

```python
import torch
from transformers import T5ForConditionalGeneration, AutoTokenizer

model_path = "aimgo/Emendator"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
model.eval()

texts = ["Nil igirur rnors cft ad nos ncq;pcrtinct hilurn»", "Vt quod ali cibus eft aliis fuat acre uenenurn."]
enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256).to(device)

with torch.no_grad():
    outputs = model.generate(
        enc["input_ids"],
        attention_mask=enc["attention_mask"],
        max_new_tokens=enc["input_ids"].shape[1] + 32,
        num_beams=4,
        do_sample=False,
        early_stopping=True,
        repetition_penalty=1.15,
    )

corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)
```

If you use this in your work, please cite: 
```
@misc{mccarthy2026Emendator,
  author       = {McCarthy, A. M.},
  title        = {{Emendator}: Latin OCR Artifact Correction},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
  note         = {Model}
}
```