Arabic OCR Readability Benchmark
Last run: June 8, 2026.
Benchmark file: test_pdfs/arabic-reader-5-page-test.pdf
Scoring uses the app's assess_text_quality and speech-readiness metrics: Arabic word count, common Arabic word hits, one-letter fragment ratio, low-information line ratio, placeholder ratio, and total quality score. Higher score is better; good is preferred over warning.
Result
Recommended OCR:
OCR_ENGINE=tesseract
OCR_RENDER_ZOOM=2
TESSERACT_PSM=4
This setting produced the most readable 5-page output while staying practical for full-book jobs.
Top 3 tested OCR settings:
Tesseract Arabic - Best readable:OCR_ENGINE=tesseract OCR_RENDER_ZOOM=2 TESSERACT_PSM=4Tesseract Arabic - Faster readable:OCR_ENGINE=tesseract-fast OCR_RENDER_ZOOM=1.5 TESSERACT_PSM=6PaddleOCR Arabic - Faster fallback:OCR_ENGINE=paddleocr
| OCR setting | Pages | Seconds | Quality | Score | Arabic words | Fragment line ratio | Extraction |
|---|---|---|---|---|---|---|---|
| Tesseract 2x PSM 4 | 5 | 37.30 | good | 11919.05 | 3120 | 0.0433 | tesseract@2x-psm4 |
| Tesseract default PSM 6 | 5 | 28.88 | good | 11510.50 | 3284 | 0.0166 | tesseract@1.5x-psm6 |
| PaddleOCR Arabic | 5 | 106.91 | warning | 8105.80 | 2251 | 0.3133 | paddleocr |
| Auto fallback | 5 | 104.47 | warning | 8105.80 | 2251 | 0.3133 | paddleocr |
| EasyOCR mode | 5 | 102.39 | warning | 8105.80 | 2251 | 0.3133 | paddleocr |
The slower comparison modes were tested on the 1-page sample because the full 5-page comparison exceeded the 10-minute run window. Both selected the same underlying winner, tesseract@2x-psm4, but took about 4.5 minutes for one page:
| OCR setting | Pages | Seconds | Quality | Score | Arabic words | Extraction |
|---|---|---|---|---|---|---|
| Arabic OCR comparison | 1 | 280.76 | good | 3565.85 | 719 | arabic:tesseract@2x-psm4 |
| Maximum Arabic OCR | 1 | 268.47 | good | 3565.85 | 719 | arabic-max:tesseract@2x-psm4 |
Interpretation
arabic and arabic-max are useful short-sample diagnostics because they can compare installed OCR engines and pick the cleanest text. They are not the right default for long PDFs on the current free worker because they spend minutes per page and selected Tesseract anyway.
PaddleOCR is available and works, but on this book sample it returned many low-information lines and more fragmented Arabic text. It remains a fallback, not the recommendation.
The live/default website setting should therefore be 1. Tesseract Arabic - Best readable.