File size: 4,936 Bytes
f909164
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
language:
- ar
tags:
- ocr
- arabic
- manuscript
- document-understanding
- rtmdet
- siglip2
- qwen3
pipeline_tag: image-to-text
license: apache-2.0
---

# HAFITH β€” حافظ Β· Arabic Manuscript OCR

OCR pipeline for Arabic historical manuscripts. Given a manuscript image it:
1. **Detects text regions** (main body vs. margin) β€” YOLO
2. **Segments individual lines** β€” RTMDet instance segmentation
3. **Recognises text per line** β€” SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM)
4. **Corrects OCR errors** β€” Gemini LLM (optional, requires API key)

---

## Model Files

| File | Description | Size |
|---|---|---|
| `lines.pth` | RTMDet-m line segmentation weights | 242 MB |
| `regions.pt` | YOLO region detection weights | 117 MB |
| `ocr/model.pt` | SigLIP2 + Qwen3-0.6B OCR weights | 3.9 GB |
| `ocr/qwen_tokenizer/` | Qwen3 tokenizer files | β€” |
| `ocr/siglip_processor/` | SigLIP2 image processor config | β€” |
| `rtmdet_lines.py` | RTMDet model config | β€” |

---

## Architecture

```
Input image
    β”‚
    β”œβ”€β–Ί YOLO (regions.pt)
    β”‚       └─ Bounding boxes: main text body vs. margin
    β”‚
    β”œβ”€β–Ί RTMDet (lines.pth + rtmdet_lines.py)
    β”‚       └─ Instance segmentation masks β†’ line polygons (reading order)
    β”‚
    └─► Per-line crops
            └─► SigLIP2 NaFlex encoder β†’ Linear(1152β†’1024) β†’ Qwen3-0.6B decoder
                        └─ Arabic text string per line
```

The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are
prepended as a visual prefix to Qwen3's input embedding space, followed by a
BOS anchor token. The decoder autoregressively generates Arabic text tokens.

---

## Requirements

```bash
pip install torch torchvision transformers ultralytics opencv-python-headless \
            Pillow numpy google-genai huggingface_hub

# mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8)
git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv
cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation
pip install mmdet mmengine
```

---

## Quick Start

```python
from huggingface_hub import snapshot_download

# Download all model files
model_dir = snapshot_download("mdnaseif/hafith-models")
```

Then run full pipeline inference β€” see [`inference.py`](inference.py).

---

## Full Pipeline Inference

```python
import sys
sys.path.insert(0, "path/to/hafith_mvp/app")   # add app/ to Python path

from pipeline import (
    load_lines_model, load_regions_model,
    load_ocr,
    segment, detect_regions, classify_lines_by_region,
    get_line_images, recognise_lines_batch,
)

MODELS_DIR = "path/to/models"   # local snapshot_download() output

# 1. Load models (one-time, ~30–90s on first run)
lines_model = load_lines_model(
    config_path=f"{MODELS_DIR}/rtmdet_lines.py",
    checkpoint_path=f"{MODELS_DIR}/lines.pth",
    device="cuda",
)
regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt")
ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda")

# 2. Segment lines
image_bgr, polygons = segment(lines_model, "manuscript.jpg")

# 3. Classify main text vs. margin
region_polys, _ = detect_regions(regions_model, "manuscript.jpg")
main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys)

# 4. Crop line images
line_images = get_line_images(image_bgr, polygons)

# 5. OCR β€” process in reading order (main body first, then margin)
reading_order = list(main_idx) + list(margin_idx)
ordered_images = [line_images[i] for i in reading_order]

texts = recognise_lines_batch(
    ocr_model, processor, tokenizer,
    ordered_images,
    device="cuda",
    max_patches=512,
    max_len=64,
    batch_size=8,
)

# 6. Print results
for i, (idx, text) in enumerate(zip(reading_order, texts)):
    print(f"Line {i+1}: {text}")

full_text = "\n".join(texts)
print("\n--- Full transcription ---")
print(full_text)
```

---

## OCR Model Only (no segmentation)

If you already have cropped line images:

```python
from PIL import Image
from pipeline.ocr import load_ocr, recognise_lines_batch

ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda")

# Single line
line_img = Image.open("line.jpg")
texts = recognise_lines_batch(
    ocr_model, processor, tokenizer,
    [line_img],
    device="cuda",
    max_patches=512,
    max_len=64,
    batch_size=1,
)
print(texts[0])
```

---

## Optional: AI Post-Correction (Gemini)

```python
import os
os.environ["GEMINI_API_KEY"] = "your-key"

from pipeline.correction import init_local_llm, correct_full_text_local

corrector = init_local_llm("gemini-2.0-flash")
corrected = correct_full_text_local(corrector, texts)
```

---

## Citation

```bibtex
@misc{hafith2025,
  title  = {HAFITH: Arabic Manuscript OCR Pipeline},
  author = {mdnaseif},
  year   = {2025},
  url    = {https://huggingface.co/mdnaseif/hafith-models}
}
```