hafith-models / README.md
mdnaseif's picture
Upload README.md with huggingface_hub
f909164 verified
---
language:
- ar
tags:
- ocr
- arabic
- manuscript
- document-understanding
- rtmdet
- siglip2
- qwen3
pipeline_tag: image-to-text
license: apache-2.0
---
# HAFITH β€” حافظ Β· Arabic Manuscript OCR
OCR pipeline for Arabic historical manuscripts. Given a manuscript image it:
1. **Detects text regions** (main body vs. margin) β€” YOLO
2. **Segments individual lines** β€” RTMDet instance segmentation
3. **Recognises text per line** β€” SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM)
4. **Corrects OCR errors** β€” Gemini LLM (optional, requires API key)
---
## Model Files
| File | Description | Size |
|---|---|---|
| `lines.pth` | RTMDet-m line segmentation weights | 242 MB |
| `regions.pt` | YOLO region detection weights | 117 MB |
| `ocr/model.pt` | SigLIP2 + Qwen3-0.6B OCR weights | 3.9 GB |
| `ocr/qwen_tokenizer/` | Qwen3 tokenizer files | β€” |
| `ocr/siglip_processor/` | SigLIP2 image processor config | β€” |
| `rtmdet_lines.py` | RTMDet model config | β€” |
---
## Architecture
```
Input image
β”‚
β”œβ”€β–Ί YOLO (regions.pt)
β”‚ └─ Bounding boxes: main text body vs. margin
β”‚
β”œβ”€β–Ί RTMDet (lines.pth + rtmdet_lines.py)
β”‚ └─ Instance segmentation masks β†’ line polygons (reading order)
β”‚
└─► Per-line crops
└─► SigLIP2 NaFlex encoder β†’ Linear(1152β†’1024) β†’ Qwen3-0.6B decoder
└─ Arabic text string per line
```
The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are
prepended as a visual prefix to Qwen3's input embedding space, followed by a
BOS anchor token. The decoder autoregressively generates Arabic text tokens.
---
## Requirements
```bash
pip install torch torchvision transformers ultralytics opencv-python-headless \
Pillow numpy google-genai huggingface_hub
# mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8)
git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv
cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation
pip install mmdet mmengine
```
---
## Quick Start
```python
from huggingface_hub import snapshot_download
# Download all model files
model_dir = snapshot_download("mdnaseif/hafith-models")
```
Then run full pipeline inference β€” see [`inference.py`](inference.py).
---
## Full Pipeline Inference
```python
import sys
sys.path.insert(0, "path/to/hafith_mvp/app") # add app/ to Python path
from pipeline import (
load_lines_model, load_regions_model,
load_ocr,
segment, detect_regions, classify_lines_by_region,
get_line_images, recognise_lines_batch,
)
MODELS_DIR = "path/to/models" # local snapshot_download() output
# 1. Load models (one-time, ~30–90s on first run)
lines_model = load_lines_model(
config_path=f"{MODELS_DIR}/rtmdet_lines.py",
checkpoint_path=f"{MODELS_DIR}/lines.pth",
device="cuda",
)
regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt")
ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda")
# 2. Segment lines
image_bgr, polygons = segment(lines_model, "manuscript.jpg")
# 3. Classify main text vs. margin
region_polys, _ = detect_regions(regions_model, "manuscript.jpg")
main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys)
# 4. Crop line images
line_images = get_line_images(image_bgr, polygons)
# 5. OCR β€” process in reading order (main body first, then margin)
reading_order = list(main_idx) + list(margin_idx)
ordered_images = [line_images[i] for i in reading_order]
texts = recognise_lines_batch(
ocr_model, processor, tokenizer,
ordered_images,
device="cuda",
max_patches=512,
max_len=64,
batch_size=8,
)
# 6. Print results
for i, (idx, text) in enumerate(zip(reading_order, texts)):
print(f"Line {i+1}: {text}")
full_text = "\n".join(texts)
print("\n--- Full transcription ---")
print(full_text)
```
---
## OCR Model Only (no segmentation)
If you already have cropped line images:
```python
from PIL import Image
from pipeline.ocr import load_ocr, recognise_lines_batch
ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda")
# Single line
line_img = Image.open("line.jpg")
texts = recognise_lines_batch(
ocr_model, processor, tokenizer,
[line_img],
device="cuda",
max_patches=512,
max_len=64,
batch_size=1,
)
print(texts[0])
```
---
## Optional: AI Post-Correction (Gemini)
```python
import os
os.environ["GEMINI_API_KEY"] = "your-key"
from pipeline.correction import init_local_llm, correct_full_text_local
corrector = init_local_llm("gemini-2.0-flash")
corrected = correct_full_text_local(corrector, texts)
```
---
## Citation
```bibtex
@misc{hafith2025,
title = {HAFITH: Arabic Manuscript OCR Pipeline},
author = {mdnaseif},
year = {2025},
url = {https://huggingface.co/mdnaseif/hafith-models}
}
```