mini-kh-OCR / README.md
phonsobon's picture
Upload README.md with huggingface_hub
4d6632d verified
---
language:
- km
- en
tags:
- ocr
- text-detection
- text-recognition
- khmer
- yolo
- crnn
- ctc
- pipeline
- pytorch
license: mit
---
# mini-kh-OCR β€” Khmer & English Document OCR Pipeline
An end-to-end OCR pipeline that combines two models to **detect**, **classify**, and **recognise** Khmer and English text from document images.
```
Input Image
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Text Detection β”‚ phonsobon/mini-text-detection (YOLO11n)
β”‚ β†’ subject / reference / β”‚
β”‚ content bounding boxes β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ crop each region
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Text Recognition β”‚ phonsobon/mini-ocr (CRNN + CTC)
β”‚ β†’ Khmer & English text β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Structured output
grouped by class
```
---
## Detection Classes
| ID | Class | Khmer | Description |
|----|-------|-------|-------------|
| `0` | `subject` | αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž» | Title or subject heading |
| `1` | `reference` | αž™αŸ„αž„ | Reference or citation |
| `2` | `content` | αž’αžαŸ’αžαž”αž‘ | Main body / paragraph text |
---
## Models Used
| Role | Repository |
|------|-----------|
| Text Detection | [phonsobon/mini-text-detection](https://huggingface.co/phonsobon/mini-text-detection) |
| Text Recognition | [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr) |
---
## Files
| File | Description |
|------|-------------|
| `mini_kh_ocr.py` | Pipeline class β€” load and import this |
---
## Installation
```bash
pip install torch torchvision ultralytics huggingface_hub pillow numpy
```
---
## Quick Start
```python
from huggingface_hub import hf_hub_download
# Download pipeline script
pipeline_path = hf_hub_download(
repo_id="phonsobon/mini-kh-OCR",
filename="mini_kh_ocr.py",
)
import importlib.util, sys
spec = importlib.util.spec_from_file_location("mini_kh_ocr", pipeline_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
MiniKhOCR = mod.MiniKhOCR
# ── Load pipeline ─────────────────────────────────────────────────────────────
ocr = MiniKhOCR()
# ── Run on an image ───────────────────────────────────────────────────────────
result = ocr("your_document.jpg")
```
---
## Output Format
`result` is a dictionary with the following structure:
```python
{
"subject": ["αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹"], # αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž» β€” subject/heading texts
"reference": ["αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ"], # αž™αŸ„αž„ β€” reference texts
"content": ["αž’αžαŸ’αžαž”αž‘...", "..."], # αž’αžαŸ’αžαž”αž‘ β€” body paragraph texts
"regions": [ # all detections sorted top β†’ bottom
{
"class": "subject",
"conf": 0.91,
"box": {"x1": 10, "y1": 5, "x2": 320, "y2": 40},
"text": "αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹",
},
{
"class": "reference",
"conf": 0.87,
"box": {"x1": 10, "y1": 50, "x2": 200, "y2": 75},
"text": "αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ",
},
...
]
}
```
---
## Usage Examples
### Access text by class
```python
result = ocr("document.jpg")
print("=== SUBJECT ===")
for text in result["subject"]:
print(text)
print("=== REFERENCE ===")
for text in result["reference"]:
print(text)
print("=== CONTENT ===")
for text in result["content"]:
print(text)
```
### Format as a structured document
```python
document = ocr.to_document(result)
print(document)
# Output:
# [SUBJECT]
# αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹
#
# [REFERENCE]
# αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ
#
# [CONTENT]
# αž’αžαŸ’αžαž”αž‘αžŠαŸ†αž”αžΌαž„
# αž’αžαŸ’αžαž”αž‘αž‘αžΈαž–αžΈαžš
```
### Verbose mode β€” print each region as it is processed
```python
result = ocr("document.jpg", verbose=True)
# [subject] (10,5)β†’(320,40) conf=0.91 β†’ 'αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž»: αžŸαŸ†αžŽαžΎαžšαžŸαž»αŸ†αž…αŸ’αž”αžΆαž”αŸ‹'
# [reference] (10,50)β†’(200,75) conf=0.87 β†’ 'αž™αŸ„αž„: αž›αŸαž ០០៑/្ៀ'
# [content] (10,90)β†’(600,120) conf=0.93 β†’ 'αž’αžαŸ’αžαž”αž‘αžŠαŸ†αž”αžΌαž„'
```
### Get cropped images alongside text
```python
result = ocr("document.jpg", return_crops=True)
for region in result["regions"]:
print(region["class"], "β†’", region["text"])
region["crop"].show() # PIL Image of the cropped region
```
### Batch processing
```python
import os
folder = "path/to/documents/"
all_results = {}
for fname in os.listdir(folder):
if fname.lower().endswith((".jpg", ".jpeg", ".png")):
path = os.path.join(folder, fname)
result = ocr(path)
all_results[fname] = {
"subject": result["subject"],
"reference": result["reference"],
"content": result["content"],
}
print(f"βœ… {fname} β€” {len(result['regions'])} regions detected")
```
### Export to JSON
```python
import json
result = ocr("document.jpg")
# Remove PIL crops before serialising (not JSON-serialisable)
exportable = {
"subject": result["subject"],
"reference": result["reference"],
"content": result["content"],
"regions": [
{k: v for k, v in r.items() if k != "crop"}
for r in result["regions"]
],
}
with open("output.json", "w", encoding="utf-8") as f:
json.dump(exportable, f, ensure_ascii=False, indent=2)
```
---
## Configuration
```python
ocr = MiniKhOCR(
det_conf = 0.25, # lower β†’ more detections, higher β†’ fewer but more confident
det_iou = 0.45, # NMS IoU threshold
det_imgsz = 640, # detection image size
device = "auto", # "auto" | "cuda" | "cpu"
)
```
---
## Limitations
- Designed for **document-style images** (printed text, clear layout).
- Text recognition works best on **single-line crops** β€” very tall content regions spanning multiple lines may merge lines together.
- Handwritten text is not supported.
---
## License
MIT