File size: 3,652 Bytes
a06ee27 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
language:
- km
- en
tags:
- ocr
- crnn
- ctc
- khmer
- text-recognition
- pytorch
license: mit
---
# mini-ocr โ Khmer & English Text Recognition
A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise **Khmer and English text** from image crops.
It uses a CTC head so it can handle variable-length text without needing segmentation.
---
## Model Architecture
| Component | Details |
|-----------|---------|
| CNN backbone | 6 ร Conv-BN-ReLU blocks with MaxPool |
| Recurrent | 2 ร Bi-LSTM (hidden = 256) with a linear bridge |
| Output | CTC linear โ `NUM_CHARS + 1` (blank = 0) |
| Input | Greyscale image, height normalised to **32 px**, width variable |
| Vocabulary | 222 characters โ lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation |
---
## Files
| File | Description |
|------|-------------|
| `model.pt` | `state_dict` โ load with the class definition below |
| `model_scripted.pt` | TorchScript version โ no class definition needed |
| `vocab.txt` | One character per line, index = line number (1-based) |
---
## Quick Start
### Install dependencies
```bash
pip install torch torchvision pillow
```
```python
import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
TOKENS = (
"abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"0123456789"
"แแแแแแ
แแแแแแแแแแแแแแแแแแแแแแแแแแแ แกแขแฃแคแฅแฆแงแฉแชแซแฌแญแฎแฏแฐแฑแฒแณ"
"แถแทแธแนแบแปแผแฝแพแฟแแแแแแ
แแแแแแแแแแแแแแแแแแแแ"
"แ แกแขแฃแคแฅแฆแงแจแฉแณ"
"!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()
def load_image(path):
img = Image.open(path).convert("L")
w, h = img.size
img = img.resize((int(w / h * 32), 32))
img = np.array(img, dtype=np.float32) / 255.0
return torch.tensor(img).unsqueeze(0).unsqueeze(0)
def ctc_decode(logits):
preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
prev, text = -1, []
for p in preds:
if p != prev and p != 0:
text.append(idx2char.get(p, ""))
prev = p
return "".join(text)
img = load_image("your_image.png").to(device)
with torch.no_grad():
result = ctc_decode(model(img))
print("OCR result:", result)
```
---
## Input Format
- **Single text-line image** (word, phrase, or a short line of text)
- Converted to **greyscale** internally
- Height resized to **32 px**; width scales proportionally
- Values normalised to `[0, 1]`
For full-document OCR, first crop individual text lines, then pass each crop to the model.
---
## Training Details
| Setting | Value |
|---------|-------|
| Epochs | 50 |
| Optimizer | Adam, lr = 1e-4 |
| Loss | CTC (`blank = 0`, `zero_infinity = True`) |
| Image height | 32 px |
| Dataset | Synthetic โ rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) |
| Train / Valid / Test split | 80 / 10 / 10 |
---
## Limitations
- Designed for **single text-line crops**, not full documents or paragraphs.
- Performance may degrade on handwritten text (trained on synthetic rendered images).
- Very small fonts (< 10 px rendered height) may produce errors.
---
## License
MIT
|