File size: 3,652 Bytes
a06ee27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
language:
  - km
  - en
tags:
  - ocr
  - crnn
  - ctc
  - khmer
  - text-recognition
  - pytorch
license: mit
---

# mini-ocr โ€” Khmer & English Text Recognition

A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise **Khmer and English text** from image crops.  
It uses a CTC head so it can handle variable-length text without needing segmentation.

---

## Model Architecture

| Component | Details |
|-----------|---------|
| CNN backbone | 6 ร— Conv-BN-ReLU blocks with MaxPool |
| Recurrent | 2 ร— Bi-LSTM (hidden = 256) with a linear bridge |
| Output | CTC linear โ†’ `NUM_CHARS + 1` (blank = 0) |
| Input | Greyscale image, height normalised to **32 px**, width variable |
| Vocabulary | 222 characters โ€” lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation |

---

## Files

| File | Description |
|------|-------------|
| `model.pt` | `state_dict` โ€” load with the class definition below |
| `model_scripted.pt` | TorchScript version โ€” no class definition needed |
| `vocab.txt` | One character per line, index = line number (1-based) |

---

## Quick Start

### Install dependencies

```bash
pip install torch torchvision pillow
```
```python
import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download

TOKENS = (
    "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789"
    "แž€แžแž‚แžƒแž„แž…แž†แž‡แžˆแž‰แžŠแž‹แžŒแžแžŽแžแžแž‘แž’แž“แž”แž•แž–แž—แž˜แž™แžšแž›แžœแžแžžแžŸแž แžกแžขแžฃแžคแžฅแžฆแžงแžฉแžชแžซแžฌแžญแžฎแžฏแžฐแžฑแžฒแžณ"
    "แžถแžทแžธแžนแžบแžปแžผแžฝแžพแžฟแŸ€แŸแŸ‚แŸƒแŸ„แŸ…แŸ†แŸ‡แŸˆแŸ‰แŸŠแŸ‹แŸŒแŸแŸŽแŸแŸแŸ‘แŸ’แŸ”แŸ•แŸ–แŸ—แŸ˜แŸ›แŸ"
    "แŸ แŸกแŸขแŸฃแŸคแŸฅแŸฆแŸงแŸจแŸฉแŸณ"
    "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()

def load_image(path):
    img = Image.open(path).convert("L")
    w, h = img.size
    img = img.resize((int(w / h * 32), 32))
    img = np.array(img, dtype=np.float32) / 255.0
    return torch.tensor(img).unsqueeze(0).unsqueeze(0)

def ctc_decode(logits):
    preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
    prev, text = -1, []
    for p in preds:
        if p != prev and p != 0:
            text.append(idx2char.get(p, ""))
        prev = p
    return "".join(text)

img = load_image("your_image.png").to(device)
with torch.no_grad():
    result = ctc_decode(model(img))
print("OCR result:", result)
```

---

## Input Format

- **Single text-line image** (word, phrase, or a short line of text)
- Converted to **greyscale** internally
- Height resized to **32 px**; width scales proportionally
- Values normalised to `[0, 1]`

For full-document OCR, first crop individual text lines, then pass each crop to the model.

---

## Training Details

| Setting | Value |
|---------|-------|
| Epochs | 50 |
| Optimizer | Adam, lr = 1e-4 |
| Loss | CTC (`blank = 0`, `zero_infinity = True`) |
| Image height | 32 px |
| Dataset | Synthetic โ€” rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) |
| Train / Valid / Test split | 80 / 10 / 10 |

---

## Limitations

- Designed for **single text-line crops**, not full documents or paragraphs.
- Performance may degrade on handwritten text (trained on synthetic rendered images).
- Very small fonts (< 10 px rendered height) may produce errors.

---

## License

MIT