tachiwin
/

PaddleOCR-VL-Tachiwin-BF16

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

ljcamargo commited on Dec 23, 2025

Commit

c13d70e

·

verified ·

1 Parent(s): cef1f9d

Update README.md

Files changed (1) hide show

README.md +33 -1

README.md CHANGED Viewed

@@ -15,7 +15,9 @@ datasets:
 ---
 # TachiwinOCR
-*for the Indigenous Languages of Mexico*
 This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights
@@ -81,6 +83,36 @@ generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
 print(generated_text)
 ```
 **Tachiwin** (from Totonac - "Language") is dedicated to bridging
 the digital divide for indigenous languages of Mexico through AI technology.

 ---
 # TachiwinOCR
+**for the Indigenous Languages of Mexico**
+_16 bits precision_
 This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights
 print(generated_text)
 ```
+---
+## 📊 Benchmark Results
+Tachiwin-OCR was evaluated against the base PaddleOCR-VL model using a diverse subset of Indigenous language samples. The fine-tuning results demonstrate significant improvements in both character and word recognition accuracy.
+### Summary Metrics
+| Metric | Base Model (Raw) | Tachiwin-OCR (Fine-tuned) | Improvement |
+| :--- | :---: | :---: | :---: |
+| **Character Error Rate (CER)** | 7.59% | 6.80% | **10.4% (Relative Reduction)** |
+| **Word Error Rate (WER)** | 25.17% | 17.36% | **+7.81% (Absolute)** |
+| **OCR Accuracy (1 - CER)** | 92.41% | 93.20% | **+0.79% (Absolute)** |
+### Detailed Comparison (Sample)
+A subset of the evaluation results across different languages, where tonal languages are the most improved by this fine-tuning:
+| Language | Raw CER | FT CER | Raw WER | FT WER | Improvement |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| `stp` (Tepehuán) | 10.95% | 0.00% | 43.55% | 0.00% | +10.95% |
+| `maz` (Central Mazahua) | 3.29% | 0.41% | 9.09% | 0.00% | +2.88% |
+| `chj` (Ojitlán Chinantec) | 16.97% | 2.21% | 52.78% | 9.72% | +14.76% |
+| `maa` (Tecóatl Mazatec) | 86.70% | 8.49% | 105.08% | 10.17% | +78.21% |
+### Key Findings
+- **High Accuracy Gains:** In many tonal languages like Tepehuán (`stp`) and Mazatec (`maa`), the fine-tuning process reduced the error rate from significant levels to nearly zero or double digits.
+- **Robustness:** The model shows high resilience against synthetic distortions implemented during the data generation phase.
+- **Word-Level Performance:** The relative reduction in Word Error Rate (WER) highlights the model's improved capability in contextualizing character sequences specific to these language families.
 **Tachiwin** (from Totonac - "Language") is dedicated to bridging
 the digital divide for indigenous languages of Mexico through AI technology.