File size: 3,652 Bytes

a06ee27

---
language:
  - km
  - en
tags:
  - ocr
  - crnn
  - ctc
  - khmer
  - text-recognition
  - pytorch
license: mit
---

# mini-ocr — Khmer & English Text Recognition

A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise **Khmer and English text** from image crops.  
It uses a CTC head so it can handle variable-length text without needing segmentation.

---

## Model Architecture

| Component | Details |
|-----------|---------|
| CNN backbone | 6 × Conv-BN-ReLU blocks with MaxPool |
| Recurrent | 2 × Bi-LSTM (hidden = 256) with a linear bridge |
| Output | CTC linear → `NUM_CHARS + 1` (blank = 0) |
| Input | Greyscale image, height normalised to **32 px**, width variable |
| Vocabulary | 222 characters — lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation |

---

## Files

| File | Description |
|------|-------------|
| `model.pt` | `state_dict` — load with the class definition below |
| `model_scripted.pt` | TorchScript version — no class definition needed |
| `vocab.txt` | One character per line, index = line number (1-based) |

---

## Quick Start

### Install dependencies

```bash
pip install torch torchvision pillow
```
```python
import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download

TOKENS = (
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789"
    "កខគឃងចឆជឈញដឋឌឍណតថទធនបផពភមយរលវឝឞសហឡអឣឤឥឦឧឩឪឫឬឭឮឯឰឱឲឳ"
    "ាិីឹឺុូួើឿៀេែៃោៅំះៈ៉៊់៌៍៎៏័៑្។៕៖ៗ៘៛៝"
    "០១២៣៤៥៦៧៨៩៳"
    "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()

def load_image(path):
    img = Image.open(path).convert("L")
    w, h = img.size
    img = img.resize((int(w / h * 32), 32))
    img = np.array(img, dtype=np.float32) / 255.0
    return torch.tensor(img).unsqueeze(0).unsqueeze(0)

def ctc_decode(logits):
    preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
    prev, text = -1, []
    for p in preds:
        if p != prev and p != 0:
            text.append(idx2char.get(p, ""))
        prev = p
    return "".join(text)

img = load_image("your_image.png").to(device)
with torch.no_grad():
    result = ctc_decode(model(img))
print("OCR result:", result)
```

---

## Input Format

- **Single text-line image** (word, phrase, or a short line of text)
- Converted to **greyscale** internally
- Height resized to **32 px**; width scales proportionally
- Values normalised to `[0, 1]`

For full-document OCR, first crop individual text lines, then pass each crop to the model.

---

## Training Details

| Setting | Value |
|---------|-------|
| Epochs | 50 |
| Optimizer | Adam, lr = 1e-4 |
| Loss | CTC (`blank = 0`, `zero_infinity = True`) |
| Image height | 32 px |
| Dataset | Synthetic — rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) |
| Train / Valid / Test split | 80 / 10 / 10 |

---

## Limitations

- Designed for **single text-line crops**, not full documents or paragraphs.
- Performance may degrade on handwritten text (trained on synthetic rendered images).
- Very small fonts (< 10 px rendered height) may produce errors.

---

## License

MIT