aimgo
/

CaputEmendatoris

Token Classification

Model card Files Files and versions

CaputEmendatoris / README.md

aimgo's picture

Update README.md

1b1dd9f verified 2 days ago

|

history blame contribute delete

2.54 kB

	---
	license: cc-by-nc-nd-4.0
	pipeline_tag: token-classification
	language:
	- la
	---


	<img src="https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/1cUXLP7zGJuWf3MPLv5_m.png"
	style="float:left;width:200px;height:200px;object-fit:cover;border-radius:50%;margin-right:16px;" />

	CaputEmendatoris is a projection head for [Emendator](https://huggingface.com/aimgo/Emendator) trained to identify OCR artifacts in Latin text at a character level.

	The model is intended to be used on segments of 250 characters. Anything else will compromise performance.

	In initial testing, using 0.25 as a probability threshold typically produced the best F1 score across all degrees of corruption.





	---
	### Light Corruption

	Orig: Antistes mihi milibus trecentis.
	OCR: Antiftes mihi milibus trecentis: " . .. .ijiscnn p inr: h
	^ ^^^^^^^^^^^^^^^^^^^^^^^^^^

	### Heavy Corruption

	Orig: Cognoscenda virtute circumscripta est scientia, quae ad experientiam pertinet et ad rationem.
	OCR: C0gn0fccndauirtutccircurnfcriptacftfcientia:quacadcxpcricntiarnpcrtinct&adrationcrn«
	^ ^^^^ ^ ^^ ^^^ ^ ^^^^ ^ ^^^ ^ ^ ^^ ^ ^ ^ ^^^^

	To use CaputEmendatoris, you can load it via the Transformers library:

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = AutoModel.from_pretrained("aimgo/CaputEmendatoris", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
	tokenizer = AutoTokenizer.from_pretrained("aimgo/Emendator")
	model.eval()

	text = "quandoquidcrn natura anirni rnortalis habctur."
	enc = tokenizer(text, return_tensors="pt").to(device)

	# detect errors at each byte
	with torch.no_grad():
	probs = model.detect(enc["input_ids"], enc["attention_mask"])

	# byte probability -> character
	byte_probs = probs[0][:-1].cpu().tolist()
	char_probs = []
	byte_idx = 0
	for c in text:
	n = len(c.encode("utf-8"))
	char_probs.append(max(byte_probs[byte_idx:byte_idx + n]) if byte_idx + n <= len(byte_probs) else 0.0)
	byte_idx += n

	output = char_probs
	```

	If you use this in your work, please cite:
	```
	@misc{mccarthy2026Emendator,
	author = {McCarthy, A. M.},
	title = {{Emendator}: Latin OCR Artifact Correction},
	year = {2026},
	howpublished = {\url{https://huggingface.co/aimgo/Emendator}},
	note = {Model}
	}
	```