Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,59 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-nd-4.0
|
| 3 |
pipeline_tag: token-classification
|
| 4 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-nd-4.0
|
| 3 |
pipeline_tag: token-classification
|
| 4 |
+
---
|
| 5 |
+
---
|
| 6 |
+
license: cc-by-nc-4.0
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/hxR6Khg6ny2rn5_pP4YIX.png"
|
| 14 |
+
style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />
|
| 15 |
+
|
| 16 |
+
**CaputEmendatoris** is a projection head for [Emendator](https://huggingface.com/aimgo/Emendator) trained to identify OCR artifacts in Latin text at a character level.
|
| 17 |
+
|
| 18 |
+
The model is intended to be used on segments of **250** characters. Anything else will compromise performance.
|
| 19 |
+
|
| 20 |
+
To use CaputEmendatoris, you can load it via the Transformers library:
|
| 21 |
+
|
| 22 |
+
```python
|
| 23 |
+
import torch
|
| 24 |
+
from transformers import AutoModel, AutoTokenizer
|
| 25 |
+
|
| 26 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 27 |
+
model = AutoModel.from_pretrained("aimgo/CaputEmendatoris", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
|
| 28 |
+
tokenizer = AutoTokenizer.from_pretrained("aimgo/Emendator")
|
| 29 |
+
model.eval()
|
| 30 |
+
|
| 31 |
+
text = "quandoquidcrn natura anirni rnortalis habctur."
|
| 32 |
+
enc = tokenizer(text, return_tensors="pt").to(device)
|
| 33 |
+
|
| 34 |
+
# detect errors at each byte
|
| 35 |
+
with torch.no_grad():
|
| 36 |
+
probs = model.detect(enc["input_ids"], enc["attention_mask"])
|
| 37 |
+
|
| 38 |
+
# byte probability -> character
|
| 39 |
+
byte_probs = probs[0][:-1].cpu().tolist()
|
| 40 |
+
char_probs = []
|
| 41 |
+
byte_idx = 0
|
| 42 |
+
for c in text:
|
| 43 |
+
n = len(c.encode("utf-8"))
|
| 44 |
+
char_probs.append(max(byte_probs[byte_idx:byte_idx + n]) if byte_idx + n <= len(byte_probs) else 0.0)
|
| 45 |
+
byte_idx += n
|
| 46 |
+
|
| 47 |
+
output = char_probs
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
If you use this in your work, please cite:
|
| 51 |
+
```
|
| 52 |
+
@misc{mccarthy2026Emendator,
|
| 53 |
+
author = {McCarthy, A. M.},
|
| 54 |
+
title = {{Emendator}: Latin OCR Artifact Correction},
|
| 55 |
+
year = {2026},
|
| 56 |
+
howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
|
| 57 |
+
note = {Model}
|
| 58 |
+
}
|
| 59 |
+
```
|