aimgo
/

CaputEmendatoris

Token Classification

Model card Files Files and versions

aimgo commited on 21 days ago

Commit

9bcbe3e

·

verified ·

1 Parent(s): 3f2e8cc

Update README.md

Files changed (1) hide show

README.md +56 -1

README.md CHANGED Viewed

@@ -1,4 +1,59 @@
 ---
 license: cc-by-nc-nd-4.0
 pipeline_tag: token-classification
----

 ---
 license: cc-by-nc-nd-4.0
 pipeline_tag: token-classification
+---
+---
+license: cc-by-nc-4.0
+pipeline_tag: text-generation
+---
+<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/hxR6Khg6ny2rn5_pP4YIX.png"
+     style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />
+**CaputEmendatoris** is a projection head for [Emendator](https://huggingface.com/aimgo/Emendator) trained to identify OCR artifacts in Latin text at a character level.
+The model is intended to be used on segments of **250** characters. Anything else will compromise performance.
+To use CaputEmendatoris, you can load it via the Transformers library:
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = AutoModel.from_pretrained("aimgo/CaputEmendatoris", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
+tokenizer = AutoTokenizer.from_pretrained("aimgo/Emendator")
+model.eval()
+text = "quandoquidcrn natura anirni rnortalis habctur."
+enc = tokenizer(text, return_tensors="pt").to(device)
+# detect errors at each byte
+with torch.no_grad():
+    probs = model.detect(enc["input_ids"], enc["attention_mask"])
+# byte probability -> character
+byte_probs = probs[0][:-1].cpu().tolist()
+char_probs = []
+byte_idx = 0
+for c in text:
+    n = len(c.encode("utf-8"))
+    char_probs.append(max(byte_probs[byte_idx:byte_idx + n]) if byte_idx + n <= len(byte_probs) else 0.0)
+    byte_idx += n
+output = char_probs
+```
+If you use this in your work, please cite:
+```
+@misc{mccarthy2026Emendator,
+  author       = {McCarthy, A. M.},
+  title        = {{Emendator}: Latin OCR Artifact Correction},
+  year         = {2026},
+  howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
+  note         = {Model}
+}
+```