| # Arabic OCR Readability Benchmark |
|
|
| Last run: June 8, 2026. |
|
|
| Benchmark file: `test_pdfs/arabic-reader-5-page-test.pdf` |
|
|
| Scoring uses the app's `assess_text_quality` and speech-readiness metrics: Arabic word count, common Arabic word hits, one-letter fragment ratio, low-information line ratio, placeholder ratio, and total quality score. Higher score is better; `good` is preferred over `warning`. |
|
|
| ## Result |
|
|
| Recommended OCR: |
|
|
| ```text |
| OCR_ENGINE=tesseract |
| OCR_RENDER_ZOOM=2 |
| TESSERACT_PSM=4 |
| ``` |
|
|
| This setting produced the most readable 5-page output while staying practical for full-book jobs. |
|
|
| Top 3 tested OCR settings: |
|
|
| 1. `Tesseract Arabic - Best readable`: `OCR_ENGINE=tesseract OCR_RENDER_ZOOM=2 TESSERACT_PSM=4` |
| 2. `Tesseract Arabic - Faster readable`: `OCR_ENGINE=tesseract-fast OCR_RENDER_ZOOM=1.5 TESSERACT_PSM=6` |
| 3. `PaddleOCR Arabic - Faster fallback`: `OCR_ENGINE=paddleocr` |
|
|
| | OCR setting | Pages | Seconds | Quality | Score | Arabic words | Fragment line ratio | Extraction | |
| | --- | ---: | ---: | --- | ---: | ---: | ---: | --- | |
| | Tesseract 2x PSM 4 | 5 | 37.30 | good | 11919.05 | 3120 | 0.0433 | `tesseract@2x-psm4` | |
| | Tesseract default PSM 6 | 5 | 28.88 | good | 11510.50 | 3284 | 0.0166 | `tesseract@1.5x-psm6` | |
| | PaddleOCR Arabic | 5 | 106.91 | warning | 8105.80 | 2251 | 0.3133 | `paddleocr` | |
| | Auto fallback | 5 | 104.47 | warning | 8105.80 | 2251 | 0.3133 | `paddleocr` | |
| | EasyOCR mode | 5 | 102.39 | warning | 8105.80 | 2251 | 0.3133 | `paddleocr` | |
|
|
| The slower comparison modes were tested on the 1-page sample because the full 5-page comparison exceeded the 10-minute run window. Both selected the same underlying winner, `tesseract@2x-psm4`, but took about 4.5 minutes for one page: |
|
|
| | OCR setting | Pages | Seconds | Quality | Score | Arabic words | Extraction | |
| | --- | ---: | ---: | --- | ---: | ---: | --- | |
| | Arabic OCR comparison | 1 | 280.76 | good | 3565.85 | 719 | `arabic:tesseract@2x-psm4` | |
| | Maximum Arabic OCR | 1 | 268.47 | good | 3565.85 | 719 | `arabic-max:tesseract@2x-psm4` | |
|
|
| ## Interpretation |
|
|
| `arabic` and `arabic-max` are useful short-sample diagnostics because they can compare installed OCR engines and pick the cleanest text. They are not the right default for long PDFs on the current free worker because they spend minutes per page and selected Tesseract anyway. |
|
|
| PaddleOCR is available and works, but on this book sample it returned many low-information lines and more fragmented Arabic text. It remains a fallback, not the recommendation. |
|
|
| The live/default website setting should therefore be `1. Tesseract Arabic - Best readable`. |
|
|