ljcamargo commited on
Commit
c13d70e
verified
1 Parent(s): cef1f9d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -1
README.md CHANGED
@@ -15,7 +15,9 @@ datasets:
15
  ---
16
 
17
  # TachiwinOCR
18
- *for the Indigenous Languages of Mexico*
 
 
19
 
20
  This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights
21
 
@@ -81,6 +83,36 @@ generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
81
  print(generated_text)
82
  ```
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  **Tachiwin** (from Totonac - "Language") is dedicated to bridging
85
  the digital divide for indigenous languages of Mexico through AI technology.
86
 
 
15
  ---
16
 
17
  # TachiwinOCR
18
+ **for the Indigenous Languages of Mexico**
19
+
20
+ _16 bits precision_
21
 
22
  This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights
23
 
 
83
  print(generated_text)
84
  ```
85
 
86
+ ---
87
+
88
+ ## 馃搳 Benchmark Results
89
+
90
+ Tachiwin-OCR was evaluated against the base PaddleOCR-VL model using a diverse subset of Indigenous language samples. The fine-tuning results demonstrate significant improvements in both character and word recognition accuracy.
91
+
92
+ ### Summary Metrics
93
+
94
+ | Metric | Base Model (Raw) | Tachiwin-OCR (Fine-tuned) | Improvement |
95
+ | :--- | :---: | :---: | :---: |
96
+ | **Character Error Rate (CER)** | 7.59% | 6.80% | **10.4% (Relative Reduction)** |
97
+ | **Word Error Rate (WER)** | 25.17% | 17.36% | **+7.81% (Absolute)** |
98
+ | **OCR Accuracy (1 - CER)** | 92.41% | 93.20% | **+0.79% (Absolute)** |
99
+
100
+ ### Detailed Comparison (Sample)
101
+
102
+ A subset of the evaluation results across different languages, where tonal languages are the most improved by this fine-tuning:
103
+
104
+ | Language | Raw CER | FT CER | Raw WER | FT WER | Improvement |
105
+ | :--- | :---: | :---: | :---: | :---: | :---: |
106
+ | `stp` (Tepehu谩n) | 10.95% | 0.00% | 43.55% | 0.00% | +10.95% |
107
+ | `maz` (Central Mazahua) | 3.29% | 0.41% | 9.09% | 0.00% | +2.88% |
108
+ | `chj` (Ojitl谩n Chinantec) | 16.97% | 2.21% | 52.78% | 9.72% | +14.76% |
109
+ | `maa` (Tec贸atl Mazatec) | 86.70% | 8.49% | 105.08% | 10.17% | +78.21% |
110
+
111
+ ### Key Findings
112
+ - **High Accuracy Gains:** In many tonal languages like Tepehu谩n (`stp`) and Mazatec (`maa`), the fine-tuning process reduced the error rate from significant levels to nearly zero or double digits.
113
+ - **Robustness:** The model shows high resilience against synthetic distortions implemented during the data generation phase.
114
+ - **Word-Level Performance:** The relative reduction in Word Error Rate (WER) highlights the model's improved capability in contextualizing character sequences specific to these language families.
115
+
116
  **Tachiwin** (from Totonac - "Language") is dedicated to bridging
117
  the digital divide for indigenous languages of Mexico through AI technology.
118