trysem's picture
Duplicate from davidventura/translator-ppocr-rec
36d05ff
|
Raw
History Blame Contribute Delete
890 Bytes
---
license: apache-2.0
language:
- he
- bn
- gu
- kn
- ml
tags:
- ocr
- text-recognition
- paddleocr
- mnn
pipeline_tag: image-to-text
---
# PP-OCRv6 fine-tuned recognizers for Hebrew + Indic
This is a fine-tune of PP-OCRv6 'small', one for Hebrew, one for (Bengali, Gujarati, Kannada, Malayalam). Both have Latin as well.
Hebrew does not do Niqqud.
Trained exclusively on synthetic data, evaluated against 3 pictures, was better than Tesseract.
- Input strip height is **48**; output is already softmax (per-char confidence = max prob).
- Emits glyphs in visual (left-to-right) order (need reversal logic for Hebrew)
## Training code
`scripts/rec_model/` in [translator-rs](https://github.com/DavidVentura/translator-rs).
## License
Fine-tune of PP-OCRv6 (Apache-2.0). Synthetic training data rendered with mixed-license
fonts (Culmus, Google Fonts OFL, SIL) over Leipzig corpora.