Duplicated from davidventura/translator-ppocr-rec

trysem
/

translator-ppocr-rec

text-recognition

Model card Files Files and versions

translator-ppocr-rec / README.md

trysem's picture

Duplicate from davidventura/translator-ppocr-rec

36d05ff 9 days ago

|

History Blame Contribute Delete

890 Bytes

	---
	license: apache-2.0
	language:
	- he
	- bn
	- gu
	- kn
	- ml
	tags:
	- ocr
	- text-recognition
	- paddleocr
	- mnn
	pipeline_tag: image-to-text
	---

	# PP-OCRv6 fine-tuned recognizers for Hebrew + Indic

	This is a fine-tune of PP-OCRv6 'small', one for Hebrew, one for (Bengali, Gujarati, Kannada, Malayalam). Both have Latin as well.

	Hebrew does not do Niqqud.

	Trained exclusively on synthetic data, evaluated against 3 pictures, was better than Tesseract.

	- Input strip height is 48; output is already softmax (per-char confidence = max prob).
	- Emits glyphs in visual (left-to-right) order (need reversal logic for Hebrew)

	## Training code
	`scripts/rec_model/` in [translator-rs](https://github.com/DavidVentura/translator-rs).

	## License
	Fine-tune of PP-OCRv6 (Apache-2.0). Synthetic training data rendered with mixed-license
	fonts (Culmus, Google Fonts OFL, SIL) over Leipzig corpora.