trysem's picture
Duplicate from davidventura/translator-ppocr-rec
36d05ff
|
Raw
History Blame Contribute Delete
890 Bytes
metadata
license: apache-2.0
language:
  - he
  - bn
  - gu
  - kn
  - ml
tags:
  - ocr
  - text-recognition
  - paddleocr
  - mnn
pipeline_tag: image-to-text

PP-OCRv6 fine-tuned recognizers for Hebrew + Indic

This is a fine-tune of PP-OCRv6 'small', one for Hebrew, one for (Bengali, Gujarati, Kannada, Malayalam). Both have Latin as well.

Hebrew does not do Niqqud.

Trained exclusively on synthetic data, evaluated against 3 pictures, was better than Tesseract.

  • Input strip height is 48; output is already softmax (per-char confidence = max prob).
  • Emits glyphs in visual (left-to-right) order (need reversal logic for Hebrew)

Training code

scripts/rec_model/ in translator-rs.

License

Fine-tune of PP-OCRv6 (Apache-2.0). Synthetic training data rendered with mixed-license fonts (Culmus, Google Fonts OFL, SIL) over Leipzig corpora.