Upload README.md with huggingface_hub

f909164 verified 4 days ago

4.94 kB

	---
	language:
	- ar
	tags:
	- ocr
	- arabic
	- manuscript
	- document-understanding
	- rtmdet
	- siglip2
	- qwen3
	pipeline_tag: image-to-text
	license: apache-2.0
	---

	# HAFITH — حافظ · Arabic Manuscript OCR

	OCR pipeline for Arabic historical manuscripts. Given a manuscript image it:
	1. Detects text regions (main body vs. margin) — YOLO
	2. Segments individual lines — RTMDet instance segmentation
	3. Recognises text per line — SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM)
	4. Corrects OCR errors — Gemini LLM (optional, requires API key)

	---

	## Model Files

	\| File \| Description \| Size \|
	\|---\|---\|---\|
	\| `lines.pth` \| RTMDet-m line segmentation weights \| 242 MB \|
	\| `regions.pt` \| YOLO region detection weights \| 117 MB \|
	\| `ocr/model.pt` \| SigLIP2 + Qwen3-0.6B OCR weights \| 3.9 GB \|
	\| `ocr/qwen_tokenizer/` \| Qwen3 tokenizer files \| — \|
	\| `ocr/siglip_processor/` \| SigLIP2 image processor config \| — \|
	\| `rtmdet_lines.py` \| RTMDet model config \| — \|

	---

	## Architecture

	```
	Input image
	│
	├─► YOLO (regions.pt)
	│ └─ Bounding boxes: main text body vs. margin
	│
	├─► RTMDet (lines.pth + rtmdet_lines.py)
	│ └─ Instance segmentation masks → line polygons (reading order)
	│
	└─► Per-line crops
	└─► SigLIP2 NaFlex encoder → Linear(1152→1024) → Qwen3-0.6B decoder
	└─ Arabic text string per line
	```

	The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are
	prepended as a visual prefix to Qwen3's input embedding space, followed by a
	BOS anchor token. The decoder autoregressively generates Arabic text tokens.

	---

	## Requirements

	```bash
	pip install torch torchvision transformers ultralytics opencv-python-headless \
	Pillow numpy google-genai huggingface_hub

	# mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8)
	git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv
	cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation
	pip install mmdet mmengine
	```

	---

	## Quick Start

	```python
	from huggingface_hub import snapshot_download

	# Download all model files
	model_dir = snapshot_download("mdnaseif/hafith-models")
	```

	Then run full pipeline inference — see [`inference.py`](inference.py).

	---

	## Full Pipeline Inference

	```python
	import sys
	sys.path.insert(0, "path/to/hafith_mvp/app") # add app/ to Python path

	from pipeline import (
	load_lines_model, load_regions_model,
	load_ocr,
	segment, detect_regions, classify_lines_by_region,
	get_line_images, recognise_lines_batch,
	)

	MODELS_DIR = "path/to/models" # local snapshot_download() output

	# 1. Load models (one-time, ~30–90s on first run)
	lines_model = load_lines_model(
	config_path=f"{MODELS_DIR}/rtmdet_lines.py",
	checkpoint_path=f"{MODELS_DIR}/lines.pth",
	device="cuda",
	)
	regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt")
	ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda")

	# 2. Segment lines
	image_bgr, polygons = segment(lines_model, "manuscript.jpg")

	# 3. Classify main text vs. margin
	region_polys, _ = detect_regions(regions_model, "manuscript.jpg")
	main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys)

	# 4. Crop line images
	line_images = get_line_images(image_bgr, polygons)

	# 5. OCR — process in reading order (main body first, then margin)
	reading_order = list(main_idx) + list(margin_idx)
	ordered_images = [line_images[i] for i in reading_order]

	texts = recognise_lines_batch(
	ocr_model, processor, tokenizer,
	ordered_images,
	device="cuda",
	max_patches=512,
	max_len=64,
	batch_size=8,
	)

	# 6. Print results
	for i, (idx, text) in enumerate(zip(reading_order, texts)):
	print(f"Line {i+1}: {text}")

	full_text = "\n".join(texts)
	print("\n--- Full transcription ---")
	print(full_text)
	```

	---

	## OCR Model Only (no segmentation)

	If you already have cropped line images:

	```python
	from PIL import Image
	from pipeline.ocr import load_ocr, recognise_lines_batch

	ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda")

	# Single line
	line_img = Image.open("line.jpg")
	texts = recognise_lines_batch(
	ocr_model, processor, tokenizer,
	[line_img],
	device="cuda",
	max_patches=512,
	max_len=64,
	batch_size=1,
	)
	print(texts[0])
	```

	---

	## Optional: AI Post-Correction (Gemini)

	```python
	import os
	os.environ["GEMINI_API_KEY"] = "your-key"

	from pipeline.correction import init_local_llm, correct_full_text_local

	corrector = init_local_llm("gemini-2.0-flash")
	corrected = correct_full_text_local(corrector, texts)
	```

	---

	## Citation

	```bibtex
	@misc{hafith2025,
	title = {HAFITH: Arabic Manuscript OCR Pipeline},
	author = {mdnaseif},
	year = {2025},
	url = {https://huggingface.co/mdnaseif/hafith-models}
	}
	```