TachiwinOCR 1.5 🦡
for the Indigenous Languages of Mexico
This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights
Inference
You can perform inference using the PaddleOCR pipeline or the transformers library.
Option A: Using PaddleOCR
from paddleocr import PaddleOCRVL
# Load the fine-tuned model
pipeline = PaddleOCRVL(
vl_rec_model_name="tachiwin/Tachiwin-OCR-1.5",
vl_rec_model_dir=path_to_tachiwin_downloaded_model,
)
# Predict on an image
output = pipeline.predict("test.png")
for res in output:
res.print()
res.save_to_json(save_path="output")
res.save_to_markdown(save_path="output")
Option B: Using Transformers
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
MODEL = "tachiwin/Tachiwin-OCR-1.5"
image_path = "my_image.png"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
image = Image.open(image_path).convert("RGB")
model = AutoModelForCausalLM.from_pretrained(
MODEL,
trust_remote_code=True,
torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "OCR:"},
]}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(DEVICE)
outputs = model.generate(**inputs, max_new_tokens=1024, min_new_tokens=1)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(generated_text)
📊 Benchmark Results
Tachiwin-OCR 1.5 was evaluated against the base PaddleOCR-VL 1.5 model using a diverse subset of Indigenous language samples. The fine-tuning results demonstrate dramatic improvements in both character and word recognition accuracy — far surpassing the gains seen in version 1.0.
Summary Metrics
| Metric | Base Model (Raw) | Tachiwin-OCR 1.5 (Fine-tuned) | Improvement |
|---|---|---|---|
| Character Error Rate (CER) | 17.65% | 2.03% | 88.5% (Relative Reduction) |
| Word Error Rate (WER) | 38.59% | 3.60% | 90.7% (Relative Reduction) |
| OCR Accuracy (1 − CER) | 82.35% | 97.97% | +15.61pp (Absolute) |
| Word Accuracy (1 − WER) | 61.41% | 96.40% | +34.99pp (Absolute) |
Version Comparison: 1.0 → 1.5
| Metric | Tachiwin-OCR v1.0 | Tachiwin-OCR v1.5 | Δ Change |
|---|---|---|---|
| CER | 6.80% | 2.03% | −4.77pp |
| WER | 17.36% | 3.60% | −13.76pp |
| Accuracy (1 − CER) | 93.20% | 97.97% | +4.77pp |
| Word Accuracy (1 − WER) | 82.64% | 96.40% | +13.76pp |
| Relative CER Reduction | 10.4% | 88.5% | +78.1pp |
| Relative WER Reduction | 31.0% | 90.7% | +59.7pp |
Detailed Comparison — v1.5 Sample Results
Results across 21 language samples. Languages with tonal or complex diacritic systems show the most dramatic improvements:
| # | Language Code | Raw CER | FT CER | Raw WER | FT WER | CER Improvement |
|---|---|---|---|---|---|---|
| 0 | zpo (Zapotec) |
0.24% | 0.00% | 1.12% | 0.00% | +0.24% |
| 1 | maz (Central Mazahua) |
0.41% | 0.00% | 2.27% | 0.00% | +0.41% |
| 2 | zao (Zapotec) |
6.18% | 3.49% | 23.61% | 12.50% | +2.69% |
| 3 | mat (Matlatzinca) |
6.51% | 0.00% | 42.55% | 0.00% | +6.51% |
| 4 | amu (Amuzgo) |
85.52% | 0.00% | 89.13% | 0.00% | +85.52% |
| 5 | mxp (Mixe) |
15.91% | 11.87% | 54.90% | 9.80% | +4.04% |
| 6 | yaq (Yaqui) |
1.82% | 0.00% | 3.12% | 0.00% | +1.82% |
| 7 | poe (Popoloca) |
6.78% | 3.39% | 62.50% | 12.50% | +3.39% |
| 8 | zpc (Zapotec) |
9.43% | 2.05% | 42.11% | 13.16% | +7.38% |
| 9 | sei (Seri) |
1.89% | 0.00% | 10.61% | 0.00% | +1.89% |
| 10 | lac (Lacandon) |
9.80% | 0.00% | 42.31% | 0.00% | +9.80% |
| 11 | zao (Zapotec) |
93.01% | 0.00% | 100.00% | 0.00% | +93.01% |
| 12 | mxt (Mixtec) |
6.70% | 0.00% | 19.18% | 0.00% | +6.70% |
| 13 | huv (San Marcos Huistepec Zapotec) |
1.41% | 0.00% | 10.34% | 0.00% | +1.41% |
| 14 | tee (Huehuetla Tepehua) |
3.03% | 0.00% | 17.33% | 0.00% | +3.03% |
| 15 | tzh (Tzeltal) |
2.67% | 0.00% | 15.91% | 0.00% | +2.67% |
| 16 | mto (Totontepec Mixe) |
93.12% | 32.47% | 100.00% | 39.71% | +60.65% |
| 17 | amu (Amuzgo) |
14.96% | 2.36% | 52.46% | 1.64% | +12.60% |
| 18 | mih (Chayuco Mixtec) |
3.76% | 0.00% | 9.52% | 0.00% | +3.76% |
| 19 | zpm (Mixtec) |
6.98% | 0.00% | 32.73% | 0.00% | +6.98% |
| 20 | toc (Tojolabal) |
11.32% | 0.00% | 57.14% | 0.00% | +11.32% |
| — | AVERAGE | 17.65% | 2.03% | 38.59% | 3.60% | +15.61% |
Key Findings
Unprecedented Accuracy Gains: 14 out of 21 languages achieved a fine-tuned CER of 0.00%, meaning perfect character-level recognition on those samples — a result not seen in v1.0.
Hardest Cases Tackled: Languages like Amuzgo (
amu) and Zapotec (zao, sample 11) started with CERs above 85–93% and were reduced to zero after fine-tuning, representing improvements of over 85 and 93 percentage points respectively.Remaining Challenges:
mto(Totontepec Mixe) remains the most difficult language in the set, with a fine-tuned CER of 32.47% — still a 65% relative improvement over its raw baseline, but indicating further work is needed for highly complex orthographies.Word-Level Leap: WER dropped from 38.59% to just 3.60% — a 34.98 percentage point absolute improvement, compared to only 7.81pp in v1.0, demonstrating a qualitative leap in the model's ability to reconstruct full word forms in these language families.
Robustness: The model continues to show high resilience against synthetic distortions applied during the data generation phase. Tachiwin (from Totonac - "Language") is dedicated to bridging the digital divide for indigenous languages of Mexico through AI technology.
Developed by: Tachiwin
License: apache-2.0
Finetuned from model : PaddlePaddle/PaddleOCR-VL-1.5
This paddleocr_vl model was trained 2x faster with Unsloth
- Downloads last month
- -
Model tree for tachiwin/Tachiwin-OCR-1.5
Dataset used to train tachiwin/Tachiwin-OCR-1.5
Space using tachiwin/Tachiwin-OCR-1.5 1
Evaluation results
- Character Error Rate (CER) on Tachiwin Multilingual OCR LLMself-reported2.030
- Word Error Rate (WER) on Tachiwin Multilingual OCR LLMself-reported3.600
- OCR Accuracy (1 - CER) on Tachiwin Multilingual OCR LLMself-reported97.970
- Word Accuracy (1 - WER) on Tachiwin Multilingual OCR LLMself-reported96.400
