TachiwinOCR 1.5 🦡

for the Indigenous Languages of Mexico

This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights

Inference

You can perform inference using the PaddleOCR pipeline or the transformers library.

Option A: Using PaddleOCR

from paddleocr import PaddleOCRVL

# Load the fine-tuned model
pipeline = PaddleOCRVL(
    vl_rec_model_name="tachiwin/Tachiwin-OCR-1.5",
    vl_rec_model_dir=path_to_tachiwin_downloaded_model,
)

# Predict on an image
output = pipeline.predict("test.png")

for res in output:
    res.print()
    res.save_to_json(save_path="output")
    res.save_to_markdown(save_path="output")

Option B: Using Transformers

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

MODEL = "tachiwin/Tachiwin-OCR-1.5"
image_path = "my_image.png"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

image = Image.open(image_path).convert("RGB")

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "OCR:"},
    ]}
]

inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True, 	
    return_dict=True,
    return_tensors="pt"
).to(DEVICE)

outputs = model.generate(**inputs, max_new_tokens=1024, min_new_tokens=1)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

print(generated_text)

📊 Benchmark Results

Tachiwin-OCR 1.5 was evaluated against the base PaddleOCR-VL 1.5 model using a diverse subset of Indigenous language samples. The fine-tuning results demonstrate dramatic improvements in both character and word recognition accuracy — far surpassing the gains seen in version 1.0.

Summary Metrics

Metric Base Model (Raw) Tachiwin-OCR 1.5 (Fine-tuned) Improvement
Character Error Rate (CER) 17.65% 2.03% 88.5% (Relative Reduction)
Word Error Rate (WER) 38.59% 3.60% 90.7% (Relative Reduction)
OCR Accuracy (1 − CER) 82.35% 97.97% +15.61pp (Absolute)
Word Accuracy (1 − WER) 61.41% 96.40% +34.99pp (Absolute)

Version Comparison: 1.0 → 1.5

Metric Tachiwin-OCR v1.0 Tachiwin-OCR v1.5 Δ Change
CER 6.80% 2.03% −4.77pp
WER 17.36% 3.60% −13.76pp
Accuracy (1 − CER) 93.20% 97.97% +4.77pp
Word Accuracy (1 − WER) 82.64% 96.40% +13.76pp
Relative CER Reduction 10.4% 88.5% +78.1pp
Relative WER Reduction 31.0% 90.7% +59.7pp

Detailed Comparison — v1.5 Sample Results

Results across 21 language samples. Languages with tonal or complex diacritic systems show the most dramatic improvements:

# Language Code Raw CER FT CER Raw WER FT WER CER Improvement
0 zpo (Zapotec) 0.24% 0.00% 1.12% 0.00% +0.24%
1 maz (Central Mazahua) 0.41% 0.00% 2.27% 0.00% +0.41%
2 zao (Zapotec) 6.18% 3.49% 23.61% 12.50% +2.69%
3 mat (Matlatzinca) 6.51% 0.00% 42.55% 0.00% +6.51%
4 amu (Amuzgo) 85.52% 0.00% 89.13% 0.00% +85.52%
5 mxp (Mixe) 15.91% 11.87% 54.90% 9.80% +4.04%
6 yaq (Yaqui) 1.82% 0.00% 3.12% 0.00% +1.82%
7 poe (Popoloca) 6.78% 3.39% 62.50% 12.50% +3.39%
8 zpc (Zapotec) 9.43% 2.05% 42.11% 13.16% +7.38%
9 sei (Seri) 1.89% 0.00% 10.61% 0.00% +1.89%
10 lac (Lacandon) 9.80% 0.00% 42.31% 0.00% +9.80%
11 zao (Zapotec) 93.01% 0.00% 100.00% 0.00% +93.01%
12 mxt (Mixtec) 6.70% 0.00% 19.18% 0.00% +6.70%
13 huv (San Marcos Huistepec Zapotec) 1.41% 0.00% 10.34% 0.00% +1.41%
14 tee (Huehuetla Tepehua) 3.03% 0.00% 17.33% 0.00% +3.03%
15 tzh (Tzeltal) 2.67% 0.00% 15.91% 0.00% +2.67%
16 mto (Totontepec Mixe) 93.12% 32.47% 100.00% 39.71% +60.65%
17 amu (Amuzgo) 14.96% 2.36% 52.46% 1.64% +12.60%
18 mih (Chayuco Mixtec) 3.76% 0.00% 9.52% 0.00% +3.76%
19 zpm (Mixtec) 6.98% 0.00% 32.73% 0.00% +6.98%
20 toc (Tojolabal) 11.32% 0.00% 57.14% 0.00% +11.32%
AVERAGE 17.65% 2.03% 38.59% 3.60% +15.61%

Key Findings

  • Unprecedented Accuracy Gains: 14 out of 21 languages achieved a fine-tuned CER of 0.00%, meaning perfect character-level recognition on those samples — a result not seen in v1.0.

  • Hardest Cases Tackled: Languages like Amuzgo (amu) and Zapotec (zao, sample 11) started with CERs above 85–93% and were reduced to zero after fine-tuning, representing improvements of over 85 and 93 percentage points respectively.

  • Remaining Challenges: mto (Totontepec Mixe) remains the most difficult language in the set, with a fine-tuned CER of 32.47% — still a 65% relative improvement over its raw baseline, but indicating further work is needed for highly complex orthographies.

  • Word-Level Leap: WER dropped from 38.59% to just 3.60% — a 34.98 percentage point absolute improvement, compared to only 7.81pp in v1.0, demonstrating a qualitative leap in the model's ability to reconstruct full word forms in these language families.

  • Robustness: The model continues to show high resilience against synthetic distortions applied during the data generation phase. Tachiwin (from Totonac - "Language") is dedicated to bridging the digital divide for indigenous languages of Mexico through AI technology.

  • Developed by: Tachiwin

  • License: apache-2.0

  • Finetuned from model : PaddlePaddle/PaddleOCR-VL-1.5

This paddleocr_vl model was trained 2x faster with Unsloth

Downloads last month
-
Safetensors
Model size
1.0B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tachiwin/Tachiwin-OCR-1.5

Adapter
(1)
this model
Quantizations
1 model

Dataset used to train tachiwin/Tachiwin-OCR-1.5

Space using tachiwin/Tachiwin-OCR-1.5 1

Evaluation results

  • Character Error Rate (CER) on Tachiwin Multilingual OCR LLM
    self-reported
    2.030
  • Word Error Rate (WER) on Tachiwin Multilingual OCR LLM
    self-reported
    3.600
  • OCR Accuracy (1 - CER) on Tachiwin Multilingual OCR LLM
    self-reported
    97.970
  • Word Accuracy (1 - WER) on Tachiwin Multilingual OCR LLM
    self-reported
    96.400