metadata
license: cc-by-nc-nd-4.0
pipeline_tag: token-classification
language:
- la

CaputEmendatoris is a projection head for Emendator trained to identify OCR artifacts in Latin text at a character level.
The model is intended to be used on segments of 250 characters. Anything else will compromise performance.
In initial testing, using 0.25 as a probability threshold typically produced the best F1 score across all degrees of corruption.
Light Corruption
Orig: Antistes mihi milibus trecentis.
OCR: Antiftes mihi milibus trecentis: " . .. .ijiscnn p inr: h
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^
Heavy Corruption
Orig: Cognoscenda virtute circumscripta est scientia, quae ad experientiam pertinet et ad rationem.
OCR: C0gn0fccndauirtutccircurnfcriptacftfcientia:quacadcxpcricntiarnpcrtinct&adrationcrn«
^ ^^^^ ^ ^^ ^^^ ^ ^^^^ ^ ^^^ ^ ^ ^^ ^ ^ ^ ^^^^
To use CaputEmendatoris, you can load it via the Transformers library:
import torch
from transformers import AutoModel, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained("aimgo/CaputEmendatoris", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained("aimgo/Emendator")
model.eval()
text = "quandoquidcrn natura anirni rnortalis habctur."
enc = tokenizer(text, return_tensors="pt").to(device)
# detect errors at each byte
with torch.no_grad():
probs = model.detect(enc["input_ids"], enc["attention_mask"])
# byte probability -> character
byte_probs = probs[0][:-1].cpu().tolist()
char_probs = []
byte_idx = 0
for c in text:
n = len(c.encode("utf-8"))
char_probs.append(max(byte_probs[byte_idx:byte_idx + n]) if byte_idx + n <= len(byte_probs) else 0.0)
byte_idx += n
output = char_probs
If you use this in your work, please cite:
@misc{mccarthy2026Emendator,
author = {McCarthy, A. M.},
title = {{Emendator}: Latin OCR Artifact Correction},
year = {2026},
howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
note = {Model}
}