| --- |
| language: |
| - km |
| - en |
| tags: |
| - ocr |
| - crnn |
| - ctc |
| - khmer |
| - text-recognition |
| - pytorch |
| license: mit |
| --- |
| |
| # mini-ocr โ Khmer & English Text Recognition |
|
|
| A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise **Khmer and English text** from image crops. |
| It uses a CTC head so it can handle variable-length text without needing segmentation. |
|
|
| --- |
|
|
| ## Model Architecture |
|
|
| | Component | Details | |
| |-----------|---------| |
| | CNN backbone | 6 ร Conv-BN-ReLU blocks with MaxPool | |
| | Recurrent | 2 ร Bi-LSTM (hidden = 256) with a linear bridge | |
| | Output | CTC linear โ `NUM_CHARS + 1` (blank = 0) | |
| | Input | Greyscale image, height normalised to **32 px**, width variable | |
| | Vocabulary | 222 characters โ lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation | |
|
|
| --- |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `model.pt` | `state_dict` โ load with the class definition below | |
| | `model_scripted.pt` | TorchScript version โ no class definition needed | |
| | `vocab.txt` | One character per line, index = line number (1-based) | |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### Install dependencies |
|
|
| ```bash |
| pip install torch torchvision pillow |
| ``` |
| ```python |
| import torch |
| import numpy as np |
| from PIL import Image |
| from huggingface_hub import hf_hub_download |
| |
| TOKENS = ( |
| "abcdefghijklmnopqrstuvwxyz" |
| "ABCDEFGHIJKLMNOPQRSTUVWXYZ" |
| "0123456789" |
| "แแแแแแ
แแแแแแแแแแแแแแแแแแแแแแแแแแแ แกแขแฃแคแฅแฆแงแฉแชแซแฌแญแฎแฏแฐแฑแฒแณ" |
| "แถแทแธแนแบแปแผแฝแพแฟแแแแแแ
แแแแแแแแแแแแแแแแแแแแ" |
| "แ แกแขแฃแคแฅแฆแงแจแฉแณ" |
| "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ " |
| ) |
| idx2char = {i + 1: c for i, c in enumerate(TOKENS)} |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| |
| scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt") |
| model = torch.jit.load(scripted_path, map_location=device) |
| model.eval() |
| |
| def load_image(path): |
| img = Image.open(path).convert("L") |
| w, h = img.size |
| img = img.resize((int(w / h * 32), 32)) |
| img = np.array(img, dtype=np.float32) / 255.0 |
| return torch.tensor(img).unsqueeze(0).unsqueeze(0) |
| |
| def ctc_decode(logits): |
| preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy() |
| prev, text = -1, [] |
| for p in preds: |
| if p != prev and p != 0: |
| text.append(idx2char.get(p, "")) |
| prev = p |
| return "".join(text) |
| |
| img = load_image("your_image.png").to(device) |
| with torch.no_grad(): |
| result = ctc_decode(model(img)) |
| print("OCR result:", result) |
| ``` |
|
|
| --- |
|
|
| ## Input Format |
|
|
| - **Single text-line image** (word, phrase, or a short line of text) |
| - Converted to **greyscale** internally |
| - Height resized to **32 px**; width scales proportionally |
| - Values normalised to `[0, 1]` |
|
|
| For full-document OCR, first crop individual text lines, then pass each crop to the model. |
|
|
| --- |
|
|
| ## Training Details |
|
|
| | Setting | Value | |
| |---------|-------| |
| | Epochs | 50 | |
| | Optimizer | Adam, lr = 1e-4 | |
| | Loss | CTC (`blank = 0`, `zero_infinity = True`) | |
| | Image height | 32 px | |
| | Dataset | Synthetic โ rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) | |
| | Train / Valid / Test split | 80 / 10 / 10 | |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Designed for **single text-line crops**, not full documents or paragraphs. |
| - Performance may degrade on handwritten text (trained on synthetic rendered images). |
| - Very small fonts (< 10 px rendered height) may produce errors. |
|
|
| --- |
|
|
| ## License |
|
|
| MIT |
|
|