arabic-audio-reader-worker / docs /recommended-free-stack.md
Syncre's picture
Deploy Arabic Audio Reader worker
6d5a99d verified
# Recommended Free Arabic PDF To Audio Stack
This is the compact decision report generated from the current research watchlist.
## Use Now
| Layer | Recommendation | Why |
| --- | --- | --- |
| Embedded PDFs | PyMuPDF text extraction first | It is free, fast, and avoids OCR errors when the PDF already contains usable Arabic text. |
| Scanned PDFs | `OCR_ENGINE=tesseract OCR_RENDER_ZOOM=2 TESSERACT_PSM=4` | It produced the most readable text on the 5-page Arabic OCR benchmark while staying much faster than the comparison modes. |
| Default voice | SILMA TTS | Arabic-focused Fusha/MSA voice with normalization and tashkeel options. |
| Download/storage | Worker-local retained audio files | Free by default and avoids Vercel's 4.5 MB function payload limit; Hugging Face free CPU disk is 50 GB but non-persistent, so downloads are short-lived. |
| Hosted shape | Vercel shell plus Docker worker via `WORKER_BASE_URL` | Vercel serves the easy website while the worker handles large PDFs, OCR, and TTS on free CPU Space hardware when the job size is reasonable. |
## Install First On A Stronger Worker
| Candidate | Type | Why | Next Step |
| --- | --- | --- | --- |
| QARI-OCR 0.4 | ocr | Directly trained for Arabic OCR on Islamic books and Arabic manuscripts. | Install the sidecar or build the worker with INSTALL_QARI_OCR=1, then benchmark against arabic-max/arabic/arabic-qwen-ocr/katib-ocr/paddleocr/tesseract on the 5-page sample. |
| PaddleOCR-VL-1.6 | ocr | Fresh Apache-2.0 PaddleOCR document parser release with a June 2026 paper signal; the model card claims SOTA document parsing/text performance and the license file is Apache-2.0, but Arabic-book quality still needs same-page scoring. | Build with INSTALL_PADDLEOCR_VL=1 only after the smaller Arabic OCR stack is not clean enough, then benchmark the same 5-page Arabic sample before any full-book run. |
| KATIB 0.8B | ocr | Fine-tuned specifically for Arabic OCR, including printed and handwritten text, while being much smaller than QARI 4B. | Install the sidecar or build the worker with INSTALL_KATIB_OCR=1, then benchmark against arabic-max/arabic/arabic-qwen-ocr/qari-ocr/paddleocr/tesseract on the 5-page sample. |
| Arabic-GLM-OCR-v2 | ocr | Recent Arabic OCR model card claims strong Arabic document extraction and noise reduction; it is wired as an optional sidecar so it can be scored against QARI/KATIB/Arabic-Qwen/Baseer on the target book pages. | Install the sidecar or build the worker with INSTALL_ARABIC_GLM_OCR=1, then benchmark it on the same 5-page sample before any full-book run. |
| Arabic-Qwen3.5-OCR-v4 | ocr | Recent Arabic OCR model card claims Arabic printed, handwritten, classical, and diacritic handling in a smaller 0.9B model. | Install the sidecar or build the worker with INSTALL_ARABIC_QWEN_OCR=1, then benchmark against arabic-max/arabic/katib-ocr/qari-ocr/paddleocr/tesseract on the 5-page sample. |
| Tawkeed OCR | ocr | Arabic-first OCR model forked from QARI-OCR v0.3 and fine-tuned for Arabic documents, handwriting, and scene text; useful to test when QARI 0.4 is too heavy or when edge-style deployment matters. | Install the sidecar or build the worker with INSTALL_TAWKEED_OCR=1, then benchmark against QARI 0.4, KATIB, Arabic-Qwen, Baseer, and Tesseract on the same 5-page sample. |
| Baseer OCR V1.0 | ocr | Arabic-specific VLM OCR for complex legal documents, multi-column layouts, stamps, tables, and handwritten/printed Arabic. | Install the sidecar or build the worker with INSTALL_BASEER_OCR=1, then benchmark against arabic-max/arabic/arabic-qwen-ocr/katib-ocr/qari-ocr/paddleocr/tesseract on the 5-page sample. |
| Habibi-TTS MSA | tts | Arabic-specific 2026 TTS family worth comparing against SILMA on MSA passages. | Install the optional sidecar and listen against the same cleaned OCR sample. |
| Supertonic 3 | tts | Supertonic 3 supports Arabic, runs locally with ONNX on CPU, and is much smaller than GPU-class multilingual voices, making it a practical free benchmark voice for long-book workers. | Install the sidecar with scripts/setup_supertonic.ps1 or build with INSTALL_SUPERTONIC=1, then benchmark it against SILMA/Habibi on the same cleaned Arabic text. |
## Benchmark Before Promoting
These are promising free/open candidates, but they should not replace the default stack until they win on the same 5-page Arabic sample and same cleaned TTS text.
### OCR
| Candidate | License | Why It Stays Benchmark-Only |
| --- | --- | --- |
| QARI-OCR 0.4 GGUF | Apache-2.0 via QARI 0.4 model card; confirm GGUF packaging metadata before production | Benchmark the GGUF package externally on the same exported Arabic pages against the wired QARI sidecar, KATIB, Arabic-Qwen, Baseer, PaddleOCR, and Tesseract before considering a llama.cpp-style worker path. |
| oi-OCR | Apache-2.0 | Export the same selected Arabic page images and compare its Markdown/text output against QARI/KATIB/Arabic-Qwen/PaddleOCR/Tesseract before considering any wiring. |
| NuExtract3 | Apache-2.0 | Use document-to-Markdown/content mode on the exported page images and score the resulting Arabic text against QARI/KATIB/Arabic-Qwen/Baseer/PaddleOCR/Tesseract before promotion. |
| Qianfan-OCR | Apache-2.0 | Benchmark externally only after QARI/KATIB/Arabic-Qwen/Baseer/PaddleOCR are not clean enough; score it on the same exported Arabic book pages before considering any worker wiring. |
| Chandra OCR 2 | Apache-2.0 code; modified OpenRAIL-M model weights | Benchmark externally on the same exported page images for hard layouts, tables, forms, or mixed-language pages; keep QARI/KATIB/Arabic-Qwen/Baseer first for Arabic books unless Chandra wins same-page scoring and the license/runtime fit. |
| dots.ocr | MIT | Run externally on the same exported Arabic page images and score the resulting text against QARI/KATIB/Arabic-Qwen/Baseer/PaddleOCR/Tesseract before considering any worker wiring. |
| olmOCR Arabic LoRA v2 | Apache-2.0 adapter; confirm base model license/runtime before production | Run externally on the same exported full-page manuscript images and compare against Ketaba, QARI, HAFITH/Glimpse line workflows, Kraken/eScriptorium, and the wired Arabic OCR baseline before considering any sidecar work. |
| Arabic Large Nougat | GPL-3.0 | Run externally on the same exported Arabic book page images and compare Markdown/text output against QARI, KATIB, Arabic-Qwen, Baseer, PaddleOCR, Tesseract, and the other external OCR benchmarks before considering any separate license-aware workflow. |
| DocTR Arabic FAST/PARSEQ | Apache-2.0 detector; recognition card lacks clear metadata, confirm before production | Benchmark externally on the same exported Arabic page images and promote only if the recognition model license is confirmed and it beats PaddleOCR/Tesseract/EasyOCR on book text ordering and word preservation. |
| Kraken/eScriptorium Arabic script | Apache-2.0 engine; model license depends on selected Kraken model | Export the same selected page images, run Kraken/eScriptorium with an Arabic-script recognition model or line-cropped workflow, then score the resulting text against the wired Arabic OCR stack before considering any sidecar work. |
| Kairawan/Qalamus manuscript OCR | free web service; engine/package license not established | Use only as an external comparison when the source PDF is manuscript-like; do not wire it into the app unless a reusable open engine, API terms, privacy story, and same-page scoring beat QARI/KATIB/Kraken/HAFITH on the selected sample. |
| GLM-OCR Arabic/French documents | check model card/base license before production use | Benchmark externally for administrative/form-like Arabic PDFs and compare against Arabic-GLM-OCR-v2, QARI, KATIB, Baseer, PaddleOCR, and Tesseract before wiring. |
### TTS
| Candidate | License | Why It Stays Benchmark-Only |
| --- | --- | --- |
| Mishkala Tashkeel | Apache-2.0 | Benchmark on the same cleaned speech sample before wiring. Promote only if listening tests improve pronunciation without changing meaning or adding distracting/incorrect harakat. |
| Tashkeel-350M | Apache-2.0 | Export the same cleaned Arabic TTS sample, create a Tashkeel-350M diacritized copy, synthesize plain/Mishkala/Tashkeel-350M with the same voice, and score meaning preservation plus long-listen comfort. |
| Mushkil | Apache-2.0 | Export the same cleaned Arabic TTS sample, create a Mushkil-diacritized copy, synthesize plain/Mishkala/Tashkeel-350M/Mushkil with the same voice, and score meaning preservation plus long-listen comfort. |
| Thaka KSAA-2026 speech diacritization | CC BY 4.0 paper; implementation/model license not established | Track for released code/weights or a permissive checkpoint. Until then, keep website preprocessing limited to same-sample Mishkala/Tashkeel-350M/Mushkil listening tests and meaning-preservation scoring. |
| 3arab-TTS 500M | Apache-2.0 | Export the same cleaned Arabic text used for SILMA/Habibi, then compare base and VoiceDesign variants for audiobook comfort, stability, and long-form pacing. |
| KaniTTS Arabic | model card says Apache-2.0, but Hugging Face metadata reports lfm1.0; confirm before production | Export the same cleaned Arabic sample used for SILMA/Habibi, then benchmark naturalness, skipped words, pacing, runtime, and license fit before considering app wiring. |
| Emirati VITS Male | Apache-2.0 | Benchmark only when the target PDF benefits from Emirati/Gulf pronunciation; keep SILMA/Habibi ahead for MSA books unless listening tests say otherwise. |
| VoxCPM2 | Apache-2.0 | Benchmark externally with the same cleaned Arabic sample before deciding whether it is worth integrating. |
| Voxtral TTS | CC-BY-NC-4.0 | Benchmark only as a personal/non-commercial strong-worker comparison using the same cleaned Arabic sample; do not wire it as the default public/free website voice. |
| OmniVoice | Apache-2.0 | Export the same cleaned Arabic text used for SILMA/Habibi and compare Arabic naturalness, speed, and setup complexity before wiring it into the app. |
| OmniVoice Arabic LoRA | Apache-2.0 | Benchmark only after the base OmniVoice command is working, using the exact same cleaned Arabic sample and reference audio. |
| Arabic-text-to-speech OmniVoice | Apache-2.0 | Export the same cleaned Arabic sample used for SILMA/Habibi and compare naturalness, skipped words, repetition, runtime, and setup complexity before any app wiring. |
| Lahgtna OmniVoice v2 | license not declared on model card | Benchmark externally only when dialect pronunciation matters, confirm licensing before production, and keep SILMA/Habibi ahead for MSA books until listening tests prove otherwise. |
| TADA multilingual TTS | Llama 3.2 license | Export the same cleaned Arabic sample and benchmark with language='ar' only if the Llama 3.2 license is acceptable; keep SILMA/Habibi ahead for the permissive default. |
| Lahgtna Chatterbox | MIT | Export the same cleaned Arabic text and listen for repetition/stability before considering app wiring. |
| NAMAA-Saudi-TTS | MIT | Benchmark only when Saudi dialect pronunciation fits the target PDF; keep SILMA/Habibi first for MSA books and compare against Saudi Arabic Qwen3-TTS and Emirati voices before wiring. |
### Current Voice Priority
Use SILMA first for the practical free Arabic audiobook voice. On a stronger worker, benchmark Habibi MSA and OmniVoice next. Keep KaniTTS benchmark-only until the `lfm1.0` Hugging Face license metadata is reconciled with the model-card Apache-2.0 text.
## Promotion Rule
Promote a model only when all of these are true:
1. It is free for the intended personal/family use.
2. Its license is acceptable for the deployment.
3. It beats the current stack on the same selected Arabic pages or same cleaned Arabic voice sample.
4. It preserves Arabic reading order, words, and pronunciation better than the default.
5. Its runtime is acceptable for the target worker.
6. The generated JSON score passes `scripts\model_promotion_gate.py` after human review.
Current practical default: PyMuPDF -> `tesseract@2x-psm4` OCR -> SILMA TTS -> downloadable worker audio.