| --- |
| language: |
| - km |
| - en |
| tags: |
| - ocr |
| - text-detection |
| - text-recognition |
| - khmer |
| - yolo |
| - crnn |
| - ctc |
| - pipeline |
| - pytorch |
| license: mit |
| --- |
| |
| # mini-kh-OCR β Khmer & English Document OCR Pipeline |
|
|
| An end-to-end OCR pipeline that combines two models to **detect**, **classify**, and **recognise** Khmer and English text from document images. |
|
|
| ``` |
| Input Image |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββ |
| β Text Detection β phonsobon/mini-text-detection (YOLO11n) |
| β β subject / reference / β |
| β content bounding boxes β |
| βββββββββββββββ¬ββββββββββββββββ |
| β crop each region |
| βΌ |
| βββββββββββββββββββββββββββββββ |
| β Text Recognition β phonsobon/mini-ocr (CRNN + CTC) |
| β β Khmer & English text β |
| βββββββββββββββ¬ββββββββββββββββ |
| β |
| βΌ |
| Structured output |
| grouped by class |
| ``` |
|
|
| --- |
|
|
| ## Detection Classes |
|
|
| | ID | Class | Khmer | Description | |
| |----|-------|-------|-------------| |
| | `0` | `subject` | ααααααααα» | Title or subject heading | |
| | `1` | `reference` | ααα | Reference or citation | |
| | `2` | `content` | α’ααααα | Main body / paragraph text | |
|
|
| --- |
|
|
| ## Models Used |
|
|
| | Role | Repository | |
| |------|-----------| |
| | Text Detection | [phonsobon/mini-text-detection](https://huggingface.co/phonsobon/mini-text-detection) | |
| | Text Recognition | [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr) | |
|
|
| --- |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `mini_kh_ocr.py` | Pipeline class β load and import this | |
|
|
| --- |
|
|
| ## Installation |
|
|
| ```bash |
| pip install torch torchvision ultralytics huggingface_hub pillow numpy |
| ``` |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| |
| # Download pipeline script |
| pipeline_path = hf_hub_download( |
| repo_id="phonsobon/mini-kh-OCR", |
| filename="mini_kh_ocr.py", |
| ) |
| |
| import importlib.util, sys |
| spec = importlib.util.spec_from_file_location("mini_kh_ocr", pipeline_path) |
| mod = importlib.util.module_from_spec(spec) |
| spec.loader.exec_module(mod) |
| |
| MiniKhOCR = mod.MiniKhOCR |
| |
| # ββ Load pipeline βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ocr = MiniKhOCR() |
| |
| # ββ Run on an image βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| result = ocr("your_document.jpg") |
| ``` |
|
|
| --- |
|
|
| ## Output Format |
|
|
| `result` is a dictionary with the following structure: |
|
|
| ```python |
| { |
| "subject": ["ααααααααα»: ααααΎααα»αα
αααΆαα"], # ααααααααα» β subject/heading texts |
| "reference": ["ααα: ααα α α α‘/α’α€"], # ααα β reference texts |
| "content": ["α’ααααα...", "..."], # α’ααααα β body paragraph texts |
| |
| "regions": [ # all detections sorted top β bottom |
| { |
| "class": "subject", |
| "conf": 0.91, |
| "box": {"x1": 10, "y1": 5, "x2": 320, "y2": 40}, |
| "text": "ααααααααα»: ααααΎααα»αα
αααΆαα", |
| }, |
| { |
| "class": "reference", |
| "conf": 0.87, |
| "box": {"x1": 10, "y1": 50, "x2": 200, "y2": 75}, |
| "text": "ααα: ααα α α α‘/α’α€", |
| }, |
| ... |
| ] |
| } |
| ``` |
|
|
| --- |
|
|
| ## Usage Examples |
|
|
| ### Access text by class |
|
|
| ```python |
| result = ocr("document.jpg") |
| |
| print("=== SUBJECT ===") |
| for text in result["subject"]: |
| print(text) |
| |
| print("=== REFERENCE ===") |
| for text in result["reference"]: |
| print(text) |
| |
| print("=== CONTENT ===") |
| for text in result["content"]: |
| print(text) |
| ``` |
|
|
| ### Format as a structured document |
|
|
| ```python |
| document = ocr.to_document(result) |
| print(document) |
| |
| # Output: |
| # [SUBJECT] |
| # ααααααααα»: ααααΎααα»αα
αααΆαα |
| # |
| # [REFERENCE] |
| # ααα: ααα α α α‘/α’α€ |
| # |
| # [CONTENT] |
| # α’αααααααααΌα |
| # α’αααααααΈααΈα |
| ``` |
|
|
| ### Verbose mode β print each region as it is processed |
|
|
| ```python |
| result = ocr("document.jpg", verbose=True) |
| |
| # [subject] (10,5)β(320,40) conf=0.91 β 'ααααααααα»: ααααΎααα»αα
αααΆαα' |
| # [reference] (10,50)β(200,75) conf=0.87 β 'ααα: ααα α α α‘/α’α€' |
| # [content] (10,90)β(600,120) conf=0.93 β 'α’αααααααααΌα' |
| ``` |
|
|
| ### Get cropped images alongside text |
|
|
| ```python |
| result = ocr("document.jpg", return_crops=True) |
| |
| for region in result["regions"]: |
| print(region["class"], "β", region["text"]) |
| region["crop"].show() # PIL Image of the cropped region |
| ``` |
|
|
| ### Batch processing |
|
|
| ```python |
| import os |
| |
| folder = "path/to/documents/" |
| all_results = {} |
| |
| for fname in os.listdir(folder): |
| if fname.lower().endswith((".jpg", ".jpeg", ".png")): |
| path = os.path.join(folder, fname) |
| result = ocr(path) |
| all_results[fname] = { |
| "subject": result["subject"], |
| "reference": result["reference"], |
| "content": result["content"], |
| } |
| print(f"β
{fname} β {len(result['regions'])} regions detected") |
| ``` |
|
|
| ### Export to JSON |
|
|
| ```python |
| import json |
| |
| result = ocr("document.jpg") |
| |
| # Remove PIL crops before serialising (not JSON-serialisable) |
| exportable = { |
| "subject": result["subject"], |
| "reference": result["reference"], |
| "content": result["content"], |
| "regions": [ |
| {k: v for k, v in r.items() if k != "crop"} |
| for r in result["regions"] |
| ], |
| } |
| |
| with open("output.json", "w", encoding="utf-8") as f: |
| json.dump(exportable, f, ensure_ascii=False, indent=2) |
| ``` |
|
|
| --- |
|
|
| ## Configuration |
|
|
| ```python |
| ocr = MiniKhOCR( |
| det_conf = 0.25, # lower β more detections, higher β fewer but more confident |
| det_iou = 0.45, # NMS IoU threshold |
| det_imgsz = 640, # detection image size |
| device = "auto", # "auto" | "cuda" | "cpu" |
| ) |
| ``` |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Designed for **document-style images** (printed text, clear layout). |
| - Text recognition works best on **single-line crops** β very tall content regions spanning multiple lines may merge lines together. |
| - Handwritten text is not supported. |
|
|
| --- |
|
|
| ## License |
|
|
| MIT |
|
|