larngear-antocr-crnn-th (v40)
Compact CRNN + CTC line recognizer for printed/typed Thai documents with Latin (English / European-accent) co-script. ~33 MB, 254-character charset. Reads one text-line crop β string. Tuned for clean documents (forms, government reports, financial statements, academic calendars/timetables).
This repo is artifacts only β no inference code. It pairs with the
recognizer in
larngear_AntOCR
(the _CRNN architecture + input normalization + CTC decode that these weights
are shape-locked to), used in production by larngear-docling.
Files
| File | What |
|---|---|
best.pt |
CRNN weights β plain state_dict, load with torch.load(weights_only=True) |
classes.json |
charset, {"chars": [...]} β 254 chars (Thai + Latin + accents + digits + punctuation) |
The two are a versioned pair: len(chars) + 1 (the +1 is the CTC blank) is
the model's output dimension. A mismatched pair decodes to garbage β always pull
both from the same revision.
Accuracy
| Benchmark | CER |
|---|---|
| AntOCR real-PDF line bench (3502 lines) | ~4.3% |
larngear-docling control corpus β region-aligned text CER (~20 pages, born-digital Thai PDFs) |
2.34% |
Out of scope (not trained for it): scene text, handwriting.
Usage
from huggingface_hub import hf_hub_download
from antocr.core import CRNNLineRecognizer # from larngear_AntOCR
repo = "jsaksrisuwan/larngear_antocr_weight"
weights = hf_hub_download(repo, "best.pt", revision="main")
classes = hf_hub_download(repo, "classes.json", revision="main")
rec = CRNNLineRecognizer(weights=weights, classes=classes)
texts = rec.recognize_batch([line_crop0, line_crop1, ...]) # grayscale np.uint8 crops
Line segmentation (finding the line crops) is the consumer's job β
larngear-docling does layout detection + line-crop before calling this.
Pin revision to a tag or commit SHA for reproducible deploys.