aimgo commited on
Commit
9bcbe3e
·
verified ·
1 Parent(s): 3f2e8cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -1
README.md CHANGED
@@ -1,4 +1,59 @@
1
  ---
2
  license: cc-by-nc-nd-4.0
3
  pipeline_tag: token-classification
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-nd-4.0
3
  pipeline_tag: token-classification
4
+ ---
5
+ ---
6
+ license: cc-by-nc-4.0
7
+ pipeline_tag: text-generation
8
+
9
+
10
+
11
+ ---
12
+
13
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/hxR6Khg6ny2rn5_pP4YIX.png"
14
+ style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />
15
+
16
+ **CaputEmendatoris** is a projection head for [Emendator](https://huggingface.com/aimgo/Emendator) trained to identify OCR artifacts in Latin text at a character level.
17
+
18
+ The model is intended to be used on segments of **250** characters. Anything else will compromise performance.
19
+
20
+ To use CaputEmendatoris, you can load it via the Transformers library:
21
+
22
+ ```python
23
+ import torch
24
+ from transformers import AutoModel, AutoTokenizer
25
+
26
+ device = "cuda" if torch.cuda.is_available() else "cpu"
27
+ model = AutoModel.from_pretrained("aimgo/CaputEmendatoris", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
28
+ tokenizer = AutoTokenizer.from_pretrained("aimgo/Emendator")
29
+ model.eval()
30
+
31
+ text = "quandoquidcrn natura anirni rnortalis habctur."
32
+ enc = tokenizer(text, return_tensors="pt").to(device)
33
+
34
+ # detect errors at each byte
35
+ with torch.no_grad():
36
+ probs = model.detect(enc["input_ids"], enc["attention_mask"])
37
+
38
+ # byte probability -> character
39
+ byte_probs = probs[0][:-1].cpu().tolist()
40
+ char_probs = []
41
+ byte_idx = 0
42
+ for c in text:
43
+ n = len(c.encode("utf-8"))
44
+ char_probs.append(max(byte_probs[byte_idx:byte_idx + n]) if byte_idx + n <= len(byte_probs) else 0.0)
45
+ byte_idx += n
46
+
47
+ output = char_probs
48
+ ```
49
+
50
+ If you use this in your work, please cite:
51
+ ```
52
+ @misc{mccarthy2026Emendator,
53
+ author = {McCarthy, A. M.},
54
+ title = {{Emendator}: Latin OCR Artifact Correction},
55
+ year = {2026},
56
+ howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
57
+ note = {Model}
58
+ }
59
+ ```