arabic-audio-reader-worker / docs /ocr-readability-benchmark.md
Syncre's picture
Deploy Arabic Audio Reader worker
985cdbe verified

Arabic OCR Readability Benchmark

Last run: June 8, 2026.

Benchmark file: test_pdfs/arabic-reader-5-page-test.pdf

Scoring uses the app's assess_text_quality and speech-readiness metrics: Arabic word count, common Arabic word hits, one-letter fragment ratio, low-information line ratio, placeholder ratio, and total quality score. Higher score is better; good is preferred over warning.

Result

Recommended OCR:

OCR_ENGINE=tesseract
OCR_RENDER_ZOOM=2
TESSERACT_PSM=4

This setting produced the most readable 5-page output while staying practical for full-book jobs.

Top 3 tested OCR settings:

  1. Tesseract Arabic - Best readable: OCR_ENGINE=tesseract OCR_RENDER_ZOOM=2 TESSERACT_PSM=4
  2. Tesseract Arabic - Faster readable: OCR_ENGINE=tesseract-fast OCR_RENDER_ZOOM=1.5 TESSERACT_PSM=6
  3. PaddleOCR Arabic - Faster fallback: OCR_ENGINE=paddleocr
OCR setting Pages Seconds Quality Score Arabic words Fragment line ratio Extraction
Tesseract 2x PSM 4 5 37.30 good 11919.05 3120 0.0433 tesseract@2x-psm4
Tesseract default PSM 6 5 28.88 good 11510.50 3284 0.0166 tesseract@1.5x-psm6
PaddleOCR Arabic 5 106.91 warning 8105.80 2251 0.3133 paddleocr
Auto fallback 5 104.47 warning 8105.80 2251 0.3133 paddleocr
EasyOCR mode 5 102.39 warning 8105.80 2251 0.3133 paddleocr

The slower comparison modes were tested on the 1-page sample because the full 5-page comparison exceeded the 10-minute run window. Both selected the same underlying winner, tesseract@2x-psm4, but took about 4.5 minutes for one page:

OCR setting Pages Seconds Quality Score Arabic words Extraction
Arabic OCR comparison 1 280.76 good 3565.85 719 arabic:tesseract@2x-psm4
Maximum Arabic OCR 1 268.47 good 3565.85 719 arabic-max:tesseract@2x-psm4

Interpretation

arabic and arabic-max are useful short-sample diagnostics because they can compare installed OCR engines and pick the cleanest text. They are not the right default for long PDFs on the current free worker because they spend minutes per page and selected Tesseract anyway.

PaddleOCR is available and works, but on this book sample it returned many low-information lines and more fragmented Arabic text. It remains a fallback, not the recommendation.

The live/default website setting should therefore be 1. Tesseract Arabic - Best readable.