--- language: - km - en tags: - ocr - crnn - ctc - khmer - text-recognition - pytorch license: mit --- # mini-ocr — Khmer & English Text Recognition A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise **Khmer and English text** from image crops. It uses a CTC head so it can handle variable-length text without needing segmentation. --- ## Model Architecture | Component | Details | |-----------|---------| | CNN backbone | 6 × Conv-BN-ReLU blocks with MaxPool | | Recurrent | 2 × Bi-LSTM (hidden = 256) with a linear bridge | | Output | CTC linear → `NUM_CHARS + 1` (blank = 0) | | Input | Greyscale image, height normalised to **32 px**, width variable | | Vocabulary | 222 characters — lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation | --- ## Files | File | Description | |------|-------------| | `model.pt` | `state_dict` — load with the class definition below | | `model_scripted.pt` | TorchScript version — no class definition needed | | `vocab.txt` | One character per line, index = line number (1-based) | --- ## Quick Start ### Install dependencies ```bash pip install torch torchvision pillow ``` ```python import torch import numpy as np from PIL import Image from huggingface_hub import hf_hub_download TOKENS = ( "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "0123456789" "កខគឃងចឆជឈញដឋឌឍណតថទធនបផពភមយរលវឝឞសហឡអឣឤឥឦឧឩឪឫឬឭឮឯឰឱឲឳ" "ាិីឹឺុូួើឿៀេែៃោៅំះៈ៉៊់៌៍៎៏័៑្។៕៖ៗ៘៛៝" "០១២៣៤៥៦៧៨៩៳" "!@#$%^&*()-_=+[]{};:'\",.<>?/|\\ " ) idx2char = {i + 1: c for i, c in enumerate(TOKENS)} device = torch.device("cuda" if torch.cuda.is_available() else "cpu") scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt") model = torch.jit.load(scripted_path, map_location=device) model.eval() def load_image(path): img = Image.open(path).convert("L") w, h = img.size img = img.resize((int(w / h * 32), 32)) img = np.array(img, dtype=np.float32) / 255.0 return torch.tensor(img).unsqueeze(0).unsqueeze(0) def ctc_decode(logits): preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy() prev, text = -1, [] for p in preds: if p != prev and p != 0: text.append(idx2char.get(p, "")) prev = p return "".join(text) img = load_image("your_image.png").to(device) with torch.no_grad(): result = ctc_decode(model(img)) print("OCR result:", result) ``` --- ## Input Format - **Single text-line image** (word, phrase, or a short line of text) - Converted to **greyscale** internally - Height resized to **32 px**; width scales proportionally - Values normalised to `[0, 1]` For full-document OCR, first crop individual text lines, then pass each crop to the model. --- ## Training Details | Setting | Value | |---------|-------| | Epochs | 50 | | Optimizer | Adam, lr = 1e-4 | | Loss | CTC (`blank = 0`, `zero_infinity = True`) | | Image height | 32 px | | Dataset | Synthetic — rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) | | Train / Valid / Test split | 80 / 10 / 10 | --- ## Limitations - Designed for **single text-line crops**, not full documents or paragraphs. - Performance may degrade on handwritten text (trained on synthetic rendered images). - Very small fonts (< 10 px rendered height) may produce errors. --- ## License MIT