CaputEmendatoris is a projection head for Emendator trained to identify OCR artifacts in Latin text at a character level.

The model is intended to be used on segments of 250 characters. Anything else will compromise performance.

In initial testing, using 0.25 as a character probability threshold typically produced the best F1 score across all degrees of corruption.

Light Corruption

  Orig:       Antistes mihi milibus trecentis.
  OCR:        Antiftes mihi milibus trecentis: " . .. .ijiscnn p inr: h
                  ^                          ^^^^^^^^^^^^^^^^^^^^^^^^^^

Heavy Corruption

  Orig:       Cognoscenda virtute circumscripta est scientia, quae ad experientiam pertinet et ad rationem.
  OCR:        C0gn0fccndauirtutccircurnfcriptacftfcientia:quacadcxpcricntiarnpcrtinct&adrationcrn«
               ^  ^^^^   ^    ^^   ^^^ ^     ^^^^         ^  ^^^ ^ ^   ^^  ^   ^   ^          ^^^^

To use CaputEmendatoris, you can load it via the Transformers library:

import torch
from transformers import AutoModel, AutoTokenizer

device = "cuda" 

model_repo = "aimgo/CaputEmendatoris"
tokenizer_repo = "aimgo/Emendator"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_repo)

model = AutoModel.from_pretrained(
    model_repo,
    trust_remote_code=True, # <=== NECESSARY, THIS HEAD HAS A CUSTOM MODELING FILE
    torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32,
).to(device)

model.eval()

text = "quandoquidcrn natura anirni rnortalis habctur."

enc = tokenizer(text, return_tensors="pt").to(device)

# detector
with torch.no_grad():
    probs = model.detect(enc["input_ids"],enc.get("attention_mask", None))

byte_probs = probs[0][:-1].detach().cpu().tolist()

char_probs = []
byte_idx = 0
for c in text:
    n = len(c.encode("utf-8"))
    if byte_idx + n <= len(byte_probs):
        char_probs.append(max(byte_probs[byte_idx:byte_idx+n]))
    else:
        char_probs.append(0.0)
    byte_idx += n

print(char_probs)

If you use this in your work, please cite:

@misc{mccarthy2026Emendator,
  author       = {McCarthy, A. M.},
  title        = {{Emendator}: Latin OCR Artifact Correction},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/aimgo/CaputEmendatoris}},
  note         = {Model}
}

Downloads last month: 6

aimgo
/

CaputEmendatoris

Light Corruption

Heavy Corruption

Dataset used to train aimgo/CaputEmendatoris