Upload README.md with huggingface_hub

4d6632d verified 3 days ago

6.76 kB

	---
	language:
	- km
	- en
	tags:
	- ocr
	- text-detection
	- text-recognition
	- khmer
	- yolo
	- crnn
	- ctc
	- pipeline
	- pytorch
	license: mit
	---

	# mini-kh-OCR — Khmer & English Document OCR Pipeline

	An end-to-end OCR pipeline that combines two models to detect, classify, and recognise Khmer and English text from document images.

	```
	Input Image
	│
	▼
	┌─────────────────────────────┐
	│ Text Detection │ phonsobon/mini-text-detection (YOLO11n)
	│ → subject / reference / │
	│ content bounding boxes │
	└─────────────┬───────────────┘
	│ crop each region
	▼
	┌─────────────────────────────┐
	│ Text Recognition │ phonsobon/mini-ocr (CRNN + CTC)
	│ → Khmer & English text │
	└─────────────┬───────────────┘
	│
	▼
	Structured output
	grouped by class
	```

	---

	## Detection Classes

	\| ID \| Class \| Khmer \| Description \|
	\|----\|-------\|-------\|-------------\|
	\| `0` \| `subject` \| កម្មវត្ថុ \| Title or subject heading \|
	\| `1` \| `reference` \| យោង \| Reference or citation \|
	\| `2` \| `content` \| អត្ថបទ \| Main body / paragraph text \|

	---

	## Models Used

	\| Role \| Repository \|
	\|------\|-----------\|
	\| Text Detection \| [phonsobon/mini-text-detection](https://huggingface.co/phonsobon/mini-text-detection) \|
	\| Text Recognition \| [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr) \|

	---

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `mini_kh_ocr.py` \| Pipeline class — load and import this \|

	---

	## Installation

	```bash
	pip install torch torchvision ultralytics huggingface_hub pillow numpy
	```

	---

	## Quick Start

	```python
	from huggingface_hub import hf_hub_download

	# Download pipeline script
	pipeline_path = hf_hub_download(
	repo_id="phonsobon/mini-kh-OCR",
	filename="mini_kh_ocr.py",
	)

	import importlib.util, sys
	spec = importlib.util.spec_from_file_location("mini_kh_ocr", pipeline_path)
	mod = importlib.util.module_from_spec(spec)
	spec.loader.exec_module(mod)

	MiniKhOCR = mod.MiniKhOCR

	# ── Load pipeline ─────────────────────────────────────────────────────────────
	ocr = MiniKhOCR()

	# ── Run on an image ───────────────────────────────────────────────────────────
	result = ocr("your_document.jpg")
	```

	---

	## Output Format

	`result` is a dictionary with the following structure:

	```python
	{
	"subject": ["កម្មវត្ថុ: សំណើរសុំច្បាប់"], # កម្មវត្ថុ — subject/heading texts
	"reference": ["យោង: លេខ ០០១/២៤"], # យោង — reference texts
	"content": ["អត្ថបទ...", "..."], # អត្ថបទ — body paragraph texts

	"regions": [ # all detections sorted top → bottom
	{
	"class": "subject",
	"conf": 0.91,
	"box": {"x1": 10, "y1": 5, "x2": 320, "y2": 40},
	"text": "កម្មវត្ថុ: សំណើរសុំច្បាប់",
	},
	{
	"class": "reference",
	"conf": 0.87,
	"box": {"x1": 10, "y1": 50, "x2": 200, "y2": 75},
	"text": "យោង: លេខ ០០១/២៤",
	},
	...
	]
	}
	```

	---

	## Usage Examples

	### Access text by class

	```python
	result = ocr("document.jpg")

	print("=== SUBJECT ===")
	for text in result["subject"]:
	print(text)

	print("=== REFERENCE ===")
	for text in result["reference"]:
	print(text)

	print("=== CONTENT ===")
	for text in result["content"]:
	print(text)
	```

	### Format as a structured document

	```python
	document = ocr.to_document(result)
	print(document)

	# Output:
	# [SUBJECT]
	# កម្មវត្ថុ: សំណើរសុំច្បាប់
	#
	# [REFERENCE]
	# យោង: លេខ ០០១/២៤
	#
	# [CONTENT]
	# អត្ថបទដំបូង
	# អត្ថបទទីពីរ
	```

	### Verbose mode — print each region as it is processed

	```python
	result = ocr("document.jpg", verbose=True)

	# [subject] (10,5)→(320,40) conf=0.91 → 'កម្មវត្ថុ: សំណើរសុំច្បាប់'
	# [reference] (10,50)→(200,75) conf=0.87 → 'យោង: លេខ ០០១/២៤'
	# [content] (10,90)→(600,120) conf=0.93 → 'អត្ថបទដំបូង'
	```

	### Get cropped images alongside text

	```python
	result = ocr("document.jpg", return_crops=True)

	for region in result["regions"]:
	print(region["class"], "→", region["text"])
	region["crop"].show() # PIL Image of the cropped region
	```

	### Batch processing

	```python
	import os

	folder = "path/to/documents/"
	all_results = {}

	for fname in os.listdir(folder):
	if fname.lower().endswith((".jpg", ".jpeg", ".png")):
	path = os.path.join(folder, fname)
	result = ocr(path)
	all_results[fname] = {
	"subject": result["subject"],
	"reference": result["reference"],
	"content": result["content"],
	}
	print(f"✅ {fname} — {len(result['regions'])} regions detected")
	```

	### Export to JSON

	```python
	import json

	result = ocr("document.jpg")

	# Remove PIL crops before serialising (not JSON-serialisable)
	exportable = {
	"subject": result["subject"],
	"reference": result["reference"],
	"content": result["content"],
	"regions": [
	{k: v for k, v in r.items() if k != "crop"}
	for r in result["regions"]
	],
	}

	with open("output.json", "w", encoding="utf-8") as f:
	json.dump(exportable, f, ensure_ascii=False, indent=2)
	```

	---

	## Configuration

	```python
	ocr = MiniKhOCR(
	det_conf = 0.25, # lower → more detections, higher → fewer but more confident
	det_iou = 0.45, # NMS IoU threshold
	det_imgsz = 640, # detection image size
	device = "auto", # "auto" \| "cuda" \| "cpu"
	)
	```

	---

	## Limitations

	- Designed for document-style images (printed text, clear layout).
	- Text recognition works best on single-line crops — very tall content regions spanning multiple lines may merge lines together.
	- Handwritten text is not supported.

	---

	## License

	MIT