| --- |
| language: |
| - ar |
| tags: |
| - ocr |
| - arabic |
| - manuscript |
| - document-understanding |
| - rtmdet |
| - siglip2 |
| - qwen3 |
| pipeline_tag: image-to-text |
| license: apache-2.0 |
| --- |
| |
| # HAFITH β ΨΨ§ΩΨΈ Β· Arabic Manuscript OCR |
|
|
| OCR pipeline for Arabic historical manuscripts. Given a manuscript image it: |
| 1. **Detects text regions** (main body vs. margin) β YOLO |
| 2. **Segments individual lines** β RTMDet instance segmentation |
| 3. **Recognises text per line** β SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM) |
| 4. **Corrects OCR errors** β Gemini LLM (optional, requires API key) |
|
|
| --- |
|
|
| ## Model Files |
|
|
| | File | Description | Size | |
| |---|---|---| |
| | `lines.pth` | RTMDet-m line segmentation weights | 242 MB | |
| | `regions.pt` | YOLO region detection weights | 117 MB | |
| | `ocr/model.pt` | SigLIP2 + Qwen3-0.6B OCR weights | 3.9 GB | |
| | `ocr/qwen_tokenizer/` | Qwen3 tokenizer files | β | |
| | `ocr/siglip_processor/` | SigLIP2 image processor config | β | |
| | `rtmdet_lines.py` | RTMDet model config | β | |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| Input image |
| β |
| βββΊ YOLO (regions.pt) |
| β ββ Bounding boxes: main text body vs. margin |
| β |
| βββΊ RTMDet (lines.pth + rtmdet_lines.py) |
| β ββ Instance segmentation masks β line polygons (reading order) |
| β |
| βββΊ Per-line crops |
| βββΊ SigLIP2 NaFlex encoder β Linear(1152β1024) β Qwen3-0.6B decoder |
| ββ Arabic text string per line |
| ``` |
|
|
| The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are |
| prepended as a visual prefix to Qwen3's input embedding space, followed by a |
| BOS anchor token. The decoder autoregressively generates Arabic text tokens. |
|
|
| --- |
|
|
| ## Requirements |
|
|
| ```bash |
| pip install torch torchvision transformers ultralytics opencv-python-headless \ |
| Pillow numpy google-genai huggingface_hub |
| |
| # mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8) |
| git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv |
| cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation |
| pip install mmdet mmengine |
| ``` |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| # Download all model files |
| model_dir = snapshot_download("mdnaseif/hafith-models") |
| ``` |
|
|
| Then run full pipeline inference β see [`inference.py`](inference.py). |
|
|
| --- |
|
|
| ## Full Pipeline Inference |
|
|
| ```python |
| import sys |
| sys.path.insert(0, "path/to/hafith_mvp/app") # add app/ to Python path |
| |
| from pipeline import ( |
| load_lines_model, load_regions_model, |
| load_ocr, |
| segment, detect_regions, classify_lines_by_region, |
| get_line_images, recognise_lines_batch, |
| ) |
| |
| MODELS_DIR = "path/to/models" # local snapshot_download() output |
| |
| # 1. Load models (one-time, ~30β90s on first run) |
| lines_model = load_lines_model( |
| config_path=f"{MODELS_DIR}/rtmdet_lines.py", |
| checkpoint_path=f"{MODELS_DIR}/lines.pth", |
| device="cuda", |
| ) |
| regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt") |
| ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda") |
| |
| # 2. Segment lines |
| image_bgr, polygons = segment(lines_model, "manuscript.jpg") |
| |
| # 3. Classify main text vs. margin |
| region_polys, _ = detect_regions(regions_model, "manuscript.jpg") |
| main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys) |
| |
| # 4. Crop line images |
| line_images = get_line_images(image_bgr, polygons) |
| |
| # 5. OCR β process in reading order (main body first, then margin) |
| reading_order = list(main_idx) + list(margin_idx) |
| ordered_images = [line_images[i] for i in reading_order] |
| |
| texts = recognise_lines_batch( |
| ocr_model, processor, tokenizer, |
| ordered_images, |
| device="cuda", |
| max_patches=512, |
| max_len=64, |
| batch_size=8, |
| ) |
| |
| # 6. Print results |
| for i, (idx, text) in enumerate(zip(reading_order, texts)): |
| print(f"Line {i+1}: {text}") |
| |
| full_text = "\n".join(texts) |
| print("\n--- Full transcription ---") |
| print(full_text) |
| ``` |
|
|
| --- |
|
|
| ## OCR Model Only (no segmentation) |
|
|
| If you already have cropped line images: |
|
|
| ```python |
| from PIL import Image |
| from pipeline.ocr import load_ocr, recognise_lines_batch |
| |
| ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda") |
| |
| # Single line |
| line_img = Image.open("line.jpg") |
| texts = recognise_lines_batch( |
| ocr_model, processor, tokenizer, |
| [line_img], |
| device="cuda", |
| max_patches=512, |
| max_len=64, |
| batch_size=1, |
| ) |
| print(texts[0]) |
| ``` |
|
|
| --- |
|
|
| ## Optional: AI Post-Correction (Gemini) |
|
|
| ```python |
| import os |
| os.environ["GEMINI_API_KEY"] = "your-key" |
| |
| from pipeline.correction import init_local_llm, correct_full_text_local |
| |
| corrector = init_local_llm("gemini-2.0-flash") |
| corrected = correct_full_text_local(corrector, texts) |
| ``` |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{hafith2025, |
| title = {HAFITH: Arabic Manuscript OCR Pipeline}, |
| author = {mdnaseif}, |
| year = {2025}, |
| url = {https://huggingface.co/mdnaseif/hafith-models} |
| } |
| ``` |
|
|