mini-ocr / README.md
phonsobon's picture
Update README.md
7ec4bd8 verified
---
language:
- km
- en
tags:
- ocr
- crnn
- ctc
- khmer
- text-recognition
- pytorch
license: mit
---
# mini-ocr โ€” Khmer & English Text Recognition
A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise **Khmer and English text** from image crops.
It uses a CTC head so it can handle variable-length text without needing segmentation.
---
## Model Architecture
| Component | Details |
|-----------|---------|
| CNN backbone | 6 ร— Conv-BN-ReLU blocks with MaxPool |
| Recurrent | 2 ร— Bi-LSTM (hidden = 256) with a linear bridge |
| Output | CTC linear โ†’ `NUM_CHARS + 1` (blank = 0) |
| Input | Greyscale image, height normalised to **32 px**, width variable |
| Vocabulary | 222 characters โ€” lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation |
---
## Files
| File | Description |
|------|-------------|
| `model.pt` | `state_dict` โ€” load with the class definition below |
| `model_scripted.pt` | TorchScript version โ€” no class definition needed |
| `vocab.txt` | One character per line, index = line number (1-based) |
---
## Quick Start
### Install dependencies
```bash
pip install torch torchvision pillow
```
```python
import torch
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
TOKENS = (
"abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"0123456789"
"แž€แžแž‚แžƒแž„แž…แž†แž‡แžˆแž‰แžŠแž‹แžŒแžแžŽแžแžแž‘แž’แž“แž”แž•แž–แž—แž˜แž™แžšแž›แžœแžแžžแžŸแž แžกแžขแžฃแžคแžฅแžฆแžงแžฉแžชแžซแžฌแžญแžฎแžฏแžฐแžฑแžฒแžณ"
"แžถแžทแžธแžนแžบแžปแžผแžฝแžพแžฟแŸ€แŸแŸ‚แŸƒแŸ„แŸ…แŸ†แŸ‡แŸˆแŸ‰แŸŠแŸ‹แŸŒแŸแŸŽแŸแŸแŸ‘แŸ’แŸ”แŸ•แŸ–แŸ—แŸ˜แŸ›แŸ"
"แŸ แŸกแŸขแŸฃแŸคแŸฅแŸฆแŸงแŸจแŸฉแŸณ"
"!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ "
)
idx2char = {i + 1: c for i, c in enumerate(TOKENS)}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
model = torch.jit.load(scripted_path, map_location=device)
model.eval()
def load_image(path):
img = Image.open(path).convert("L")
w, h = img.size
img = img.resize((int(w / h * 32), 32))
img = np.array(img, dtype=np.float32) / 255.0
return torch.tensor(img).unsqueeze(0).unsqueeze(0)
def ctc_decode(logits):
preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
prev, text = -1, []
for p in preds:
if p != prev and p != 0:
text.append(idx2char.get(p, ""))
prev = p
return "".join(text)
img = load_image("your_image.png").to(device)
with torch.no_grad():
result = ctc_decode(model(img))
print("OCR result:", result)
```
---
## Input Format
- **Single text-line image** (word, phrase, or a short line of text)
- Converted to **greyscale** internally
- Height resized to **32 px**; width scales proportionally
- Values normalised to `[0, 1]`
For full-document OCR, first crop individual text lines, then pass each crop to the model.
---
## Training Details
| Setting | Value |
|---------|-------|
| Epochs | 50 |
| Optimizer | Adam, lr = 1e-4 |
| Loss | CTC (`blank = 0`, `zero_infinity = True`) |
| Image height | 32 px |
| Dataset | Synthetic โ€” rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) |
| Train / Valid / Test split | 80 / 10 / 10 |
---
## Limitations
- Designed for **single text-line crops**, not full documents or paragraphs.
- Performance may degrade on handwritten text (trained on synthetic rendered images).
- Very small fonts (< 10 px rendered height) may produce errors.
---
## License
MIT