--- language: - ar tags: - ocr - arabic - manuscript - document-understanding - rtmdet - siglip2 - qwen3 pipeline_tag: image-to-text license: apache-2.0 --- # HAFITH — حافظ · Arabic Manuscript OCR OCR pipeline for Arabic historical manuscripts. Given a manuscript image it: 1. **Detects text regions** (main body vs. margin) — YOLO 2. **Segments individual lines** — RTMDet instance segmentation 3. **Recognises text per line** — SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM) 4. **Corrects OCR errors** — Gemini LLM (optional, requires API key) --- ## Model Files | File | Description | Size | |---|---|---| | `lines.pth` | RTMDet-m line segmentation weights | 242 MB | | `regions.pt` | YOLO region detection weights | 117 MB | | `ocr/model.pt` | SigLIP2 + Qwen3-0.6B OCR weights | 3.9 GB | | `ocr/qwen_tokenizer/` | Qwen3 tokenizer files | — | | `ocr/siglip_processor/` | SigLIP2 image processor config | — | | `rtmdet_lines.py` | RTMDet model config | — | --- ## Architecture ``` Input image │ ├─► YOLO (regions.pt) │ └─ Bounding boxes: main text body vs. margin │ ├─► RTMDet (lines.pth + rtmdet_lines.py) │ └─ Instance segmentation masks → line polygons (reading order) │ └─► Per-line crops └─► SigLIP2 NaFlex encoder → Linear(1152→1024) → Qwen3-0.6B decoder └─ Arabic text string per line ``` The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are prepended as a visual prefix to Qwen3's input embedding space, followed by a BOS anchor token. The decoder autoregressively generates Arabic text tokens. --- ## Requirements ```bash pip install torch torchvision transformers ultralytics opencv-python-headless \ Pillow numpy google-genai huggingface_hub # mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8) git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation pip install mmdet mmengine ``` --- ## Quick Start ```python from huggingface_hub import snapshot_download # Download all model files model_dir = snapshot_download("mdnaseif/hafith-models") ``` Then run full pipeline inference — see [`inference.py`](inference.py). --- ## Full Pipeline Inference ```python import sys sys.path.insert(0, "path/to/hafith_mvp/app") # add app/ to Python path from pipeline import ( load_lines_model, load_regions_model, load_ocr, segment, detect_regions, classify_lines_by_region, get_line_images, recognise_lines_batch, ) MODELS_DIR = "path/to/models" # local snapshot_download() output # 1. Load models (one-time, ~30–90s on first run) lines_model = load_lines_model( config_path=f"{MODELS_DIR}/rtmdet_lines.py", checkpoint_path=f"{MODELS_DIR}/lines.pth", device="cuda", ) regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt") ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda") # 2. Segment lines image_bgr, polygons = segment(lines_model, "manuscript.jpg") # 3. Classify main text vs. margin region_polys, _ = detect_regions(regions_model, "manuscript.jpg") main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys) # 4. Crop line images line_images = get_line_images(image_bgr, polygons) # 5. OCR — process in reading order (main body first, then margin) reading_order = list(main_idx) + list(margin_idx) ordered_images = [line_images[i] for i in reading_order] texts = recognise_lines_batch( ocr_model, processor, tokenizer, ordered_images, device="cuda", max_patches=512, max_len=64, batch_size=8, ) # 6. Print results for i, (idx, text) in enumerate(zip(reading_order, texts)): print(f"Line {i+1}: {text}") full_text = "\n".join(texts) print("\n--- Full transcription ---") print(full_text) ``` --- ## OCR Model Only (no segmentation) If you already have cropped line images: ```python from PIL import Image from pipeline.ocr import load_ocr, recognise_lines_batch ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda") # Single line line_img = Image.open("line.jpg") texts = recognise_lines_batch( ocr_model, processor, tokenizer, [line_img], device="cuda", max_patches=512, max_len=64, batch_size=1, ) print(texts[0]) ``` --- ## Optional: AI Post-Correction (Gemini) ```python import os os.environ["GEMINI_API_KEY"] = "your-key" from pipeline.correction import init_local_llm, correct_full_text_local corrector = init_local_llm("gemini-2.0-flash") corrected = correct_full_text_local(corrector, texts) ``` --- ## Citation ```bibtex @misc{hafith2025, title = {HAFITH: Arabic Manuscript OCR Pipeline}, author = {mdnaseif}, year = {2025}, url = {https://huggingface.co/mdnaseif/hafith-models} } ```