---
language:
  - km
  - en
tags:
  - ocr
  - text-detection
  - text-recognition
  - khmer
  - yolo
  - crnn
  - ctc
  - pipeline
  - pytorch
license: mit
---

# mini-kh-OCR — Khmer & English Document OCR Pipeline

An end-to-end OCR pipeline that combines two models to **detect**, **classify**, and **recognise** Khmer and English text from document images.

```
Input Image
    │
    ▼
┌─────────────────────────────┐
│  Text Detection             │  phonsobon/mini-text-detection (YOLO11n)
│  → subject / reference /   │
│    content bounding boxes   │
└─────────────┬───────────────┘
              │  crop each region
              ▼
┌─────────────────────────────┐
│  Text Recognition           │  phonsobon/mini-ocr (CRNN + CTC)
│  → Khmer & English text     │
└─────────────┬───────────────┘
              │
              ▼
     Structured output
     grouped by class
```

---

## Detection Classes

| ID | Class | Khmer | Description |
|----|-------|-------|-------------|
| `0` | `subject` | កម្មវត្ថុ | Title or subject heading |
| `1` | `reference` | យោង | Reference or citation |
| `2` | `content` | អត្ថបទ | Main body / paragraph text |

---

## Models Used

| Role | Repository |
|------|-----------|
| Text Detection | [phonsobon/mini-text-detection](https://huggingface.co/phonsobon/mini-text-detection) |
| Text Recognition | [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr) |

---

## Files

| File | Description |
|------|-------------|
| `mini_kh_ocr.py` | Pipeline class — load and import this |

---

## Installation

```bash
pip install torch torchvision ultralytics huggingface_hub pillow numpy
```

---

## Quick Start

```python
from huggingface_hub import hf_hub_download

# Download pipeline script
pipeline_path = hf_hub_download(
    repo_id="phonsobon/mini-kh-OCR",
    filename="mini_kh_ocr.py",
)

import importlib.util, sys
spec = importlib.util.spec_from_file_location("mini_kh_ocr", pipeline_path)
mod  = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

MiniKhOCR = mod.MiniKhOCR

# ── Load pipeline ─────────────────────────────────────────────────────────────
ocr = MiniKhOCR()

# ── Run on an image ───────────────────────────────────────────────────────────
result = ocr("your_document.jpg")
```

---

## Output Format

`result` is a dictionary with the following structure:

```python
{
    "subject":   ["កម្មវត្ថុ: សំណើរសុំច្បាប់"],   # កម្មវត្ថុ — subject/heading texts
    "reference": ["យោង: លេខ ០០១/២៤"],           # យោង — reference texts
    "content":   ["អត្ថបទ...", "..."],            # អត្ថបទ — body paragraph texts

    "regions": [                                  # all detections sorted top → bottom
        {
            "class": "subject",
            "conf":  0.91,
            "box":   {"x1": 10, "y1": 5, "x2": 320, "y2": 40},
            "text":  "កម្មវត្ថុ: សំណើរសុំច្បាប់",
        },
        {
            "class": "reference",
            "conf":  0.87,
            "box":   {"x1": 10, "y1": 50, "x2": 200, "y2": 75},
            "text":  "យោង: លេខ ០០១/២៤",
        },
        ...
    ]
}
```

---

## Usage Examples

### Access text by class

```python
result = ocr("document.jpg")

print("=== SUBJECT ===")
for text in result["subject"]:
    print(text)

print("=== REFERENCE ===")
for text in result["reference"]:
    print(text)

print("=== CONTENT ===")
for text in result["content"]:
    print(text)
```

### Format as a structured document

```python
document = ocr.to_document(result)
print(document)

# Output:
# [SUBJECT]
# កម្មវត្ថុ: សំណើរសុំច្បាប់
#
# [REFERENCE]
# យោង: លេខ ០០១/២៤
#
# [CONTENT]
# អត្ថបទដំបូង
# អត្ថបទទីពីរ
```

### Verbose mode — print each region as it is processed

```python
result = ocr("document.jpg", verbose=True)

# [subject]   (10,5)→(320,40)   conf=0.91  →  'កម្មវត្ថុ: សំណើរសុំច្បាប់'
# [reference] (10,50)→(200,75)  conf=0.87  →  'យោង: លេខ ០០១/២៤'
# [content]   (10,90)→(600,120) conf=0.93  →  'អត្ថបទដំបូង'
```

### Get cropped images alongside text

```python
result = ocr("document.jpg", return_crops=True)

for region in result["regions"]:
    print(region["class"], "→", region["text"])
    region["crop"].show()   # PIL Image of the cropped region
```

### Batch processing

```python
import os

folder = "path/to/documents/"
all_results = {}

for fname in os.listdir(folder):
    if fname.lower().endswith((".jpg", ".jpeg", ".png")):
        path = os.path.join(folder, fname)
        result = ocr(path)
        all_results[fname] = {
            "subject":   result["subject"],
            "reference": result["reference"],
            "content":   result["content"],
        }
        print(f"✅ {fname} — {len(result['regions'])} regions detected")
```

### Export to JSON

```python
import json

result = ocr("document.jpg")

# Remove PIL crops before serialising (not JSON-serialisable)
exportable = {
    "subject":   result["subject"],
    "reference": result["reference"],
    "content":   result["content"],
    "regions": [
        {k: v for k, v in r.items() if k != "crop"}
        for r in result["regions"]
    ],
}

with open("output.json", "w", encoding="utf-8") as f:
    json.dump(exportable, f, ensure_ascii=False, indent=2)
```

---

## Configuration

```python
ocr = MiniKhOCR(
    det_conf  = 0.25,   # lower → more detections, higher → fewer but more confident
    det_iou   = 0.45,   # NMS IoU threshold
    det_imgsz = 640,    # detection image size
    device    = "auto", # "auto" | "cuda" | "cpu"
)
```

---

## Limitations

- Designed for **document-style images** (printed text, clear layout).
- Text recognition works best on **single-line crops** — very tall content regions spanning multiple lines may merge lines together.
- Handwritten text is not supported.

---

## License

MIT