--- language: - km - en tags: - ocr - text-detection - text-recognition - khmer - yolo - crnn - ctc - pipeline - pytorch license: mit --- # mini-kh-OCR — Khmer & English Document OCR Pipeline An end-to-end OCR pipeline that combines two models to **detect**, **classify**, and **recognise** Khmer and English text from document images. ``` Input Image │ ▼ ┌─────────────────────────────┐ │ Text Detection │ phonsobon/mini-text-detection (YOLO11n) │ → subject / reference / │ │ content bounding boxes │ └─────────────┬───────────────┘ │ crop each region ▼ ┌─────────────────────────────┐ │ Text Recognition │ phonsobon/mini-ocr (CRNN + CTC) │ → Khmer & English text │ └─────────────┬───────────────┘ │ ▼ Structured output grouped by class ``` --- ## Detection Classes | ID | Class | Khmer | Description | |----|-------|-------|-------------| | `0` | `subject` | កម្មវត្ថុ | Title or subject heading | | `1` | `reference` | យោង | Reference or citation | | `2` | `content` | អត្ថបទ | Main body / paragraph text | --- ## Models Used | Role | Repository | |------|-----------| | Text Detection | [phonsobon/mini-text-detection](https://huggingface.co/phonsobon/mini-text-detection) | | Text Recognition | [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr) | --- ## Files | File | Description | |------|-------------| | `mini_kh_ocr.py` | Pipeline class — load and import this | --- ## Installation ```bash pip install torch torchvision ultralytics huggingface_hub pillow numpy ``` --- ## Quick Start ```python from huggingface_hub import hf_hub_download # Download pipeline script pipeline_path = hf_hub_download( repo_id="phonsobon/mini-kh-OCR", filename="mini_kh_ocr.py", ) import importlib.util, sys spec = importlib.util.spec_from_file_location("mini_kh_ocr", pipeline_path) mod = importlib.util.module_from_spec(spec) spec.loader.exec_module(mod) MiniKhOCR = mod.MiniKhOCR # ── Load pipeline ───────────────────────────────────────────────────────────── ocr = MiniKhOCR() # ── Run on an image ─────────────────────────────────────────────────────────── result = ocr("your_document.jpg") ``` --- ## Output Format `result` is a dictionary with the following structure: ```python { "subject": ["កម្មវត្ថុ: សំណើរសុំច្បាប់"], # កម្មវត្ថុ — subject/heading texts "reference": ["យោង: លេខ ០០១/២៤"], # យោង — reference texts "content": ["អត្ថបទ...", "..."], # អត្ថបទ — body paragraph texts "regions": [ # all detections sorted top → bottom { "class": "subject", "conf": 0.91, "box": {"x1": 10, "y1": 5, "x2": 320, "y2": 40}, "text": "កម្មវត្ថុ: សំណើរសុំច្បាប់", }, { "class": "reference", "conf": 0.87, "box": {"x1": 10, "y1": 50, "x2": 200, "y2": 75}, "text": "យោង: លេខ ០០១/២៤", }, ... ] } ``` --- ## Usage Examples ### Access text by class ```python result = ocr("document.jpg") print("=== SUBJECT ===") for text in result["subject"]: print(text) print("=== REFERENCE ===") for text in result["reference"]: print(text) print("=== CONTENT ===") for text in result["content"]: print(text) ``` ### Format as a structured document ```python document = ocr.to_document(result) print(document) # Output: # [SUBJECT] # កម្មវត្ថុ: សំណើរសុំច្បាប់ # # [REFERENCE] # យោង: លេខ ០០១/២៤ # # [CONTENT] # អត្ថបទដំបូង # អត្ថបទទីពីរ ``` ### Verbose mode — print each region as it is processed ```python result = ocr("document.jpg", verbose=True) # [subject] (10,5)→(320,40) conf=0.91 → 'កម្មវត្ថុ: សំណើរសុំច្បាប់' # [reference] (10,50)→(200,75) conf=0.87 → 'យោង: លេខ ០០១/២៤' # [content] (10,90)→(600,120) conf=0.93 → 'អត្ថបទដំបូង' ``` ### Get cropped images alongside text ```python result = ocr("document.jpg", return_crops=True) for region in result["regions"]: print(region["class"], "→", region["text"]) region["crop"].show() # PIL Image of the cropped region ``` ### Batch processing ```python import os folder = "path/to/documents/" all_results = {} for fname in os.listdir(folder): if fname.lower().endswith((".jpg", ".jpeg", ".png")): path = os.path.join(folder, fname) result = ocr(path) all_results[fname] = { "subject": result["subject"], "reference": result["reference"], "content": result["content"], } print(f"✅ {fname} — {len(result['regions'])} regions detected") ``` ### Export to JSON ```python import json result = ocr("document.jpg") # Remove PIL crops before serialising (not JSON-serialisable) exportable = { "subject": result["subject"], "reference": result["reference"], "content": result["content"], "regions": [ {k: v for k, v in r.items() if k != "crop"} for r in result["regions"] ], } with open("output.json", "w", encoding="utf-8") as f: json.dump(exportable, f, ensure_ascii=False, indent=2) ``` --- ## Configuration ```python ocr = MiniKhOCR( det_conf = 0.25, # lower → more detections, higher → fewer but more confident det_iou = 0.45, # NMS IoU threshold det_imgsz = 640, # detection image size device = "auto", # "auto" | "cuda" | "cpu" ) ``` --- ## Limitations - Designed for **document-style images** (printed text, clear layout). - Text recognition works best on **single-line crops** — very tall content regions spanning multiple lines may merge lines together. - Handwritten text is not supported. --- ## License MIT