|
|
--- |
|
|
license: cc-by-nc-nd-4.0 |
|
|
pipeline_tag: token-classification |
|
|
language: |
|
|
- la |
|
|
--- |
|
|
|
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/1cUXLP7zGJuWf3MPLv5_m.png" |
|
|
style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" /> |
|
|
|
|
|
**CaputEmendatoris** is a projection head for [Emendator](https://huggingface.com/aimgo/Emendator) trained to identify OCR artifacts in Latin text at a character level. |
|
|
|
|
|
The model is intended to be used on segments of **250** characters. Anything else will compromise performance. |
|
|
|
|
|
In initial testing, using **0.25** as a probability threshold typically produced the best F1 score across all degrees of corruption. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
### Light Corruption |
|
|
|
|
|
Orig: Antistes mihi milibus trecentis. |
|
|
OCR: Antiftes mihi milibus trecentis: " . .. .ijiscnn p inr: h |
|
|
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
|
|
### Heavy Corruption |
|
|
|
|
|
Orig: Cognoscenda virtute circumscripta est scientia, quae ad experientiam pertinet et ad rationem. |
|
|
OCR: C0gn0fccndauirtutccircurnfcriptacftfcientia:quacadcxpcricntiarnpcrtinct&adrationcrn« |
|
|
^ ^^^^ ^ ^^ ^^^ ^ ^^^^ ^ ^^^ ^ ^ ^^ ^ ^ ^ ^^^^ |
|
|
|
|
|
To use CaputEmendatoris, you can load it via the Transformers library: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model = AutoModel.from_pretrained("aimgo/CaputEmendatoris", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device) |
|
|
tokenizer = AutoTokenizer.from_pretrained("aimgo/Emendator") |
|
|
model.eval() |
|
|
|
|
|
text = "quandoquidcrn natura anirni rnortalis habctur." |
|
|
enc = tokenizer(text, return_tensors="pt").to(device) |
|
|
|
|
|
# detect errors at each byte |
|
|
with torch.no_grad(): |
|
|
probs = model.detect(enc["input_ids"], enc["attention_mask"]) |
|
|
|
|
|
# byte probability -> character |
|
|
byte_probs = probs[0][:-1].cpu().tolist() |
|
|
char_probs = [] |
|
|
byte_idx = 0 |
|
|
for c in text: |
|
|
n = len(c.encode("utf-8")) |
|
|
char_probs.append(max(byte_probs[byte_idx:byte_idx + n]) if byte_idx + n <= len(byte_probs) else 0.0) |
|
|
byte_idx += n |
|
|
|
|
|
output = char_probs |
|
|
``` |
|
|
|
|
|
If you use this in your work, please cite: |
|
|
``` |
|
|
@misc{mccarthy2026Emendator, |
|
|
author = {McCarthy, A. M.}, |
|
|
title = {{Emendator}: Latin OCR Artifact Correction}, |
|
|
year = {2026}, |
|
|
howpublished = {\url{https://huggingface.co/aimgo/Emendator}}, |
|
|
note = {Model} |
|
|
} |
|
|
``` |