TachiwinOCR 1.5 🦡

for the Indigenous Languages of Mexico

This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights

Inference

You can perform inference using the PaddleOCR pipeline or the transformers library.

Option A: Using PaddleOCR

from paddleocr import PaddleOCRVL

# Load the fine-tuned model
pipeline = PaddleOCRVL(
    vl_rec_model_name="tachiwin/Tachiwin-OCR-1.5",
    vl_rec_model_dir=path_to_tachiwin_downloaded_model,
)

# Predict on an image
output = pipeline.predict("test.png")

for res in output:
    res.print()
    res.save_to_json(save_path="output")
    res.save_to_markdown(save_path="output")

Option B: Using Transformers

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

MODEL = "tachiwin/Tachiwin-OCR-1.5"
image_path = "my_image.png"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

image = Image.open(image_path).convert("RGB")

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "OCR:"},
    ]}
]

inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True, 	
    return_dict=True,
    return_tensors="pt"
).to(DEVICE)

outputs = model.generate(**inputs, max_new_tokens=1024, min_new_tokens=1)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

print(generated_text)

📊 Benchmark Results

Tachiwin-OCR 1.5 was evaluated against the base PaddleOCR-VL 1.5 model using a diverse subset of Indigenous language samples. The fine-tuning results demonstrate dramatic improvements in both character and word recognition accuracy — far surpassing the gains seen in version 1.0.

Summary Metrics

Metric	Base Model (Raw)	Tachiwin-OCR 1.5 (Fine-tuned)	Improvement
Character Error Rate (CER)	17.65%	2.03%	88.5% (Relative Reduction)
Word Error Rate (WER)	38.59%	3.60%	90.7% (Relative Reduction)
OCR Accuracy (1 − CER)	82.35%	97.97%	+15.61pp (Absolute)
Word Accuracy (1 − WER)	61.41%	96.40%	+34.99pp (Absolute)

Version Comparison: 1.0 → 1.5

Metric	Tachiwin-OCR v1.0	Tachiwin-OCR v1.5	Δ Change
CER	6.80%	2.03%	−4.77pp
WER	17.36%	3.60%	−13.76pp
Accuracy (1 − CER)	93.20%	97.97%	+4.77pp
Word Accuracy (1 − WER)	82.64%	96.40%	+13.76pp
Relative CER Reduction	10.4%	88.5%	+78.1pp
Relative WER Reduction	31.0%	90.7%	+59.7pp

Detailed Comparison — v1.5 Sample Results

Results across 21 language samples. Languages with tonal or complex diacritic systems show the most dramatic improvements:

#	Language Code	Raw CER	FT CER	Raw WER	FT WER	CER Improvement
0	`zpo` (Zapotec)	0.24%	0.00%	1.12%	0.00%	+0.24%
1	`maz` (Central Mazahua)	0.41%	0.00%	2.27%	0.00%	+0.41%
2	`zao` (Zapotec)	6.18%	3.49%	23.61%	12.50%	+2.69%
3	`mat` (Matlatzinca)	6.51%	0.00%	42.55%	0.00%	+6.51%
4	`amu` (Amuzgo)	85.52%	0.00%	89.13%	0.00%	+85.52%
5	`mxp` (Mixe)	15.91%	11.87%	54.90%	9.80%	+4.04%
6	`yaq` (Yaqui)	1.82%	0.00%	3.12%	0.00%	+1.82%
7	`poe` (Popoloca)	6.78%	3.39%	62.50%	12.50%	+3.39%
8	`zpc` (Zapotec)	9.43%	2.05%	42.11%	13.16%	+7.38%
9	`sei` (Seri)	1.89%	0.00%	10.61%	0.00%	+1.89%
10	`lac` (Lacandon)	9.80%	0.00%	42.31%	0.00%	+9.80%
11	`zao` (Zapotec)	93.01%	0.00%	100.00%	0.00%	+93.01%
12	`mxt` (Mixtec)	6.70%	0.00%	19.18%	0.00%	+6.70%
13	`huv` (San Marcos Huistepec Zapotec)	1.41%	0.00%	10.34%	0.00%	+1.41%
14	`tee` (Huehuetla Tepehua)	3.03%	0.00%	17.33%	0.00%	+3.03%
15	`tzh` (Tzeltal)	2.67%	0.00%	15.91%	0.00%	+2.67%
16	`mto` (Totontepec Mixe)	93.12%	32.47%	100.00%	39.71%	+60.65%
17	`amu` (Amuzgo)	14.96%	2.36%	52.46%	1.64%	+12.60%
18	`mih` (Chayuco Mixtec)	3.76%	0.00%	9.52%	0.00%	+3.76%
19	`zpm` (Mixtec)	6.98%	0.00%	32.73%	0.00%	+6.98%
20	`toc` (Tojolabal)	11.32%	0.00%	57.14%	0.00%	+11.32%
—	AVERAGE	17.65%	2.03%	38.59%	3.60%	+15.61%

Key Findings

Unprecedented Accuracy Gains: 14 out of 21 languages achieved a fine-tuned CER of 0.00%, meaning perfect character-level recognition on those samples — a result not seen in v1.0.
Hardest Cases Tackled: Languages like Amuzgo (amu) and Zapotec (zao, sample 11) started with CERs above 85–93% and were reduced to zero after fine-tuning, representing improvements of over 85 and 93 percentage points respectively.
Remaining Challenges: mto (Totontepec Mixe) remains the most difficult language in the set, with a fine-tuned CER of 32.47% — still a 65% relative improvement over its raw baseline, but indicating further work is needed for highly complex orthographies.
Word-Level Leap: WER dropped from 38.59% to just 3.60% — a 34.98 percentage point absolute improvement, compared to only 7.81pp in v1.0, demonstrating a qualitative leap in the model's ability to reconstruct full word forms in these language families.
Robustness: The model continues to show high resilience against synthetic distortions applied during the data generation phase. Tachiwin (from Totonac - "Language") is dedicated to bridging the digital divide for indigenous languages of Mexico through AI technology.
Developed by: Tachiwin
License: apache-2.0
Finetuned from model : PaddlePaddle/PaddleOCR-VL-1.5

This paddleocr_vl model was trained 2x faster with Unsloth

Downloads last month: -

Safetensors

Model size

1.0B params

Tensor type

BF16

Model tree for tachiwin/Tachiwin-OCR-1.5

Base model

baidu/ERNIE-4.5-0.3B-Paddle

Finetuned

PaddlePaddle/PaddleOCR-VL-1.5

Adapter

(1)

this model

Quantizations

1 model

Dataset used to train tachiwin/Tachiwin-OCR-1.5

Space using tachiwin/Tachiwin-OCR-1.5 1

Evaluation results

Character Error Rate (CER) on Tachiwin Multilingual OCR LLM
self-reported

2.030
Word Error Rate (WER) on Tachiwin Multilingual OCR LLM
self-reported

3.600
OCR Accuracy (1 - CER) on Tachiwin Multilingual OCR LLM
self-reported

97.970
Word Accuracy (1 - WER) on Tachiwin Multilingual OCR LLM
self-reported

96.400