| license: apache-2.0 | |
| language: | |
| - he | |
| - bn | |
| - gu | |
| - kn | |
| - ml | |
| tags: | |
| - ocr | |
| - text-recognition | |
| - paddleocr | |
| - mnn | |
| pipeline_tag: image-to-text | |
| # PP-OCRv6 fine-tuned recognizers for Hebrew + Indic | |
| This is a fine-tune of PP-OCRv6 'small', one for Hebrew, one for (Bengali, Gujarati, Kannada, Malayalam). Both have Latin as well. | |
| Hebrew does not do Niqqud. | |
| Trained exclusively on synthetic data, evaluated against 3 pictures, was better than Tesseract. | |
| - Input strip height is **48**; output is already softmax (per-char confidence = max prob). | |
| - Emits glyphs in visual (left-to-right) order (need reversal logic for Hebrew) | |
| ## Training code | |
| `scripts/rec_model/` in [translator-rs](https://github.com/DavidVentura/translator-rs). | |
| ## License | |
| Fine-tune of PP-OCRv6 (Apache-2.0). Synthetic training data rendered with mixed-license | |
| fonts (Culmus, Google Fonts OFL, SIL) over Leipzig corpora. | |