File size: 2,541 Bytes
3f2e8cc
 
 
1b1dd9f
 
9bcbe3e
 
2f19334
 
9bcbe3e
 
 
 
 
 
8796b12
b69cfff
 
 
8a8b2d5
bcdfed8
 
2f19334
8a8b2d5
2f19334
 
 
 
 
8a8b2d5
2f19334
 
 
8a8b2d5
9bcbe3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: cc-by-nc-nd-4.0
pipeline_tag: token-classification
language:
- la
---


<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/1cUXLP7zGJuWf3MPLv5_m.png"
     style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />
     
**CaputEmendatoris** is a projection head for [Emendator](https://huggingface.com/aimgo/Emendator) trained to identify OCR artifacts in Latin text at a character level. 

The model is intended to be used on segments of **250** characters. Anything else will compromise performance. 

In initial testing, using **0.25** as a probability threshold typically produced the best F1 score across all degrees of corruption.





---
### Light Corruption

      Orig:       Antistes mihi milibus trecentis.
      OCR:        Antiftes mihi milibus trecentis: " . .. .ijiscnn p inr: h
                      ^                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
          
### Heavy Corruption

      Orig:       Cognoscenda virtute circumscripta est scientia, quae ad experientiam pertinet et ad rationem.
      OCR:        C0gn0fccndauirtutccircurnfcriptacftfcientia:quacadcxpcricntiarnpcrtinct&adrationcrn«
                   ^  ^^^^   ^    ^^   ^^^ ^     ^^^^         ^  ^^^ ^ ^   ^^  ^   ^   ^          ^^^^

To use CaputEmendatoris, you can load it via the Transformers library:

```python
import torch
from transformers import AutoModel, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained("aimgo/CaputEmendatoris", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained("aimgo/Emendator")
model.eval()

text = "quandoquidcrn natura anirni rnortalis habctur."
enc = tokenizer(text, return_tensors="pt").to(device)

# detect errors at each byte
with torch.no_grad():
    probs = model.detect(enc["input_ids"], enc["attention_mask"])

# byte probability -> character
byte_probs = probs[0][:-1].cpu().tolist()
char_probs = []
byte_idx = 0
for c in text:
    n = len(c.encode("utf-8"))
    char_probs.append(max(byte_probs[byte_idx:byte_idx + n]) if byte_idx + n <= len(byte_probs) else 0.0)
    byte_idx += n

output = char_probs
```

If you use this in your work, please cite: 
```
@misc{mccarthy2026Emendator,
  author       = {McCarthy, A. M.},
  title        = {{Emendator}: Latin OCR Artifact Correction},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
  note         = {Model}
}
```