File size: 4,936 Bytes

f909164

---
language:
- ar
tags:
- ocr
- arabic
- manuscript
- document-understanding
- rtmdet
- siglip2
- qwen3
pipeline_tag: image-to-text
license: apache-2.0
---

# HAFITH — حافظ · Arabic Manuscript OCR

OCR pipeline for Arabic historical manuscripts. Given a manuscript image it:
1. **Detects text regions** (main body vs. margin) — YOLO
2. **Segments individual lines** — RTMDet instance segmentation
3. **Recognises text per line** — SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM)
4. **Corrects OCR errors** — Gemini LLM (optional, requires API key)

---

## Model Files

| File | Description | Size |
|---|---|---|
| `lines.pth` | RTMDet-m line segmentation weights | 242 MB |
| `regions.pt` | YOLO region detection weights | 117 MB |
| `ocr/model.pt` | SigLIP2 + Qwen3-0.6B OCR weights | 3.9 GB |
| `ocr/qwen_tokenizer/` | Qwen3 tokenizer files | — |
| `ocr/siglip_processor/` | SigLIP2 image processor config | — |
| `rtmdet_lines.py` | RTMDet model config | — |

---

## Architecture

```
Input image
    │
    ├─► YOLO (regions.pt)
    │       └─ Bounding boxes: main text body vs. margin
    │
    ├─► RTMDet (lines.pth + rtmdet_lines.py)
    │       └─ Instance segmentation masks → line polygons (reading order)
    │
    └─► Per-line crops
            └─► SigLIP2 NaFlex encoder → Linear(1152→1024) → Qwen3-0.6B decoder
                        └─ Arabic text string per line
```

The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are
prepended as a visual prefix to Qwen3's input embedding space, followed by a
BOS anchor token. The decoder autoregressively generates Arabic text tokens.

---

## Requirements

```bash
pip install torch torchvision transformers ultralytics opencv-python-headless \
            Pillow numpy google-genai huggingface_hub

# mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8)
git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv
cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation
pip install mmdet mmengine
```

---

## Quick Start

```python
from huggingface_hub import snapshot_download

# Download all model files
model_dir = snapshot_download("mdnaseif/hafith-models")
```

Then run full pipeline inference — see [`inference.py`](inference.py).

---

## Full Pipeline Inference

```python
import sys
sys.path.insert(0, "path/to/hafith_mvp/app")   # add app/ to Python path

from pipeline import (
    load_lines_model, load_regions_model,
    load_ocr,
    segment, detect_regions, classify_lines_by_region,
    get_line_images, recognise_lines_batch,
)

MODELS_DIR = "path/to/models"   # local snapshot_download() output

# 1. Load models (one-time, ~30–90s on first run)
lines_model = load_lines_model(
    config_path=f"{MODELS_DIR}/rtmdet_lines.py",
    checkpoint_path=f"{MODELS_DIR}/lines.pth",
    device="cuda",
)
regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt")
ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda")

# 2. Segment lines
image_bgr, polygons = segment(lines_model, "manuscript.jpg")

# 3. Classify main text vs. margin
region_polys, _ = detect_regions(regions_model, "manuscript.jpg")
main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys)

# 4. Crop line images
line_images = get_line_images(image_bgr, polygons)

# 5. OCR — process in reading order (main body first, then margin)
reading_order = list(main_idx) + list(margin_idx)
ordered_images = [line_images[i] for i in reading_order]

texts = recognise_lines_batch(
    ocr_model, processor, tokenizer,
    ordered_images,
    device="cuda",
    max_patches=512,
    max_len=64,
    batch_size=8,
)

# 6. Print results
for i, (idx, text) in enumerate(zip(reading_order, texts)):
    print(f"Line {i+1}: {text}")

full_text = "\n".join(texts)
print("\n--- Full transcription ---")
print(full_text)
```

---

## OCR Model Only (no segmentation)

If you already have cropped line images:

```python
from PIL import Image
from pipeline.ocr import load_ocr, recognise_lines_batch

ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda")

# Single line
line_img = Image.open("line.jpg")
texts = recognise_lines_batch(
    ocr_model, processor, tokenizer,
    [line_img],
    device="cuda",
    max_patches=512,
    max_len=64,
    batch_size=1,
)
print(texts[0])
```

---

## Optional: AI Post-Correction (Gemini)

```python
import os
os.environ["GEMINI_API_KEY"] = "your-key"

from pipeline.correction import init_local_llm, correct_full_text_local

corrector = init_local_llm("gemini-2.0-flash")
corrected = correct_full_text_local(corrector, texts)
```

---

## Citation

```bibtex
@misc{hafith2025,
  title  = {HAFITH: Arabic Manuscript OCR Pipeline},
  author = {mdnaseif},
  year   = {2025},
  url    = {https://huggingface.co/mdnaseif/hafith-models}
}
```