phonsobon
/

mini-ocr

text-recognition

Model card Files Files and versions

mini-ocr / README.md

phonsobon's picture

Update README.md

7ec4bd8 verified 3 days ago

|

history blame contribute delete

3.65 kB

	---
	language:
	- km
	- en
	tags:
	- ocr
	- crnn
	- ctc
	- khmer
	- text-recognition
	- pytorch
	license: mit
	---

	# mini-ocr — Khmer & English Text Recognition

	A lightweight CRNN (CNN + Bi-LSTM) model trained to recognise Khmer and English text from image crops.
	It uses a CTC head so it can handle variable-length text without needing segmentation.

	---

	## Model Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| CNN backbone \| 6 × Conv-BN-ReLU blocks with MaxPool \|
	\| Recurrent \| 2 × Bi-LSTM (hidden = 256) with a linear bridge \|
	\| Output \| CTC linear → `NUM_CHARS + 1` (blank = 0) \|
	\| Input \| Greyscale image, height normalised to 32 px, width variable \|
	\| Vocabulary \| 222 characters — lowercase/uppercase Latin, digits, Khmer consonants, vowels, diacritics, punctuation \|

	---

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.pt` \| `state_dict` — load with the class definition below \|
	\| `model_scripted.pt` \| TorchScript version — no class definition needed \|
	\| `vocab.txt` \| One character per line, index = line number (1-based) \|

	---

	## Quick Start

	### Install dependencies

	```bash
	pip install torch torchvision pillow
	```
	```python
	import torch
	import numpy as np
	from PIL import Image
	from huggingface_hub import hf_hub_download

	TOKENS = (
	"abcdefghijklmnopqrstuvwxyz"
	"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
	"0123456789"
	"កខគឃងចឆជឈញដឋឌឍណតថទធនបផពភមយរលវឝឞសហឡអឣឤឥឦឧឩឪឫឬឭឮឯឰឱឲឳ"
	"ាិីឹឺុូួើឿៀេែៃោៅំះៈ៉៊់៌៍៎៏័៑្។៕៖ៗ៘៛៝"
	"០១២៣៤៥៦៧៨៩៳"
	"!@#$%^&*()-_=+[]{};:'\",.<>?/\|\\ "
	)
	idx2char = {i + 1: c for i, c in enumerate(TOKENS)}

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	scripted_path = hf_hub_download(repo_id="phonsobon/mini-ocr", filename="model_scripted.pt")
	model = torch.jit.load(scripted_path, map_location=device)
	model.eval()

	def load_image(path):
	img = Image.open(path).convert("L")
	w, h = img.size
	img = img.resize((int(w / h * 32), 32))
	img = np.array(img, dtype=np.float32) / 255.0
	return torch.tensor(img).unsqueeze(0).unsqueeze(0)

	def ctc_decode(logits):
	preds = torch.argmax(logits, dim=2)[:, 0].cpu().numpy()
	prev, text = -1, []
	for p in preds:
	if p != prev and p != 0:
	text.append(idx2char.get(p, ""))
	prev = p
	return "".join(text)

	img = load_image("your_image.png").to(device)
	with torch.no_grad():
	result = ctc_decode(model(img))
	print("OCR result:", result)
	```

	---

	## Input Format

	- Single text-line image (word, phrase, or a short line of text)
	- Converted to greyscale internally
	- Height resized to 32 px; width scales proportionally
	- Values normalised to `[0, 1]`

	For full-document OCR, first crop individual text lines, then pass each crop to the model.

	---

	## Training Details

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Epochs \| 50 \|
	\| Optimizer \| Adam, lr = 1e-4 \|
	\| Loss \| CTC (`blank = 0`, `zero_infinity = True`) \|
	\| Image height \| 32 px \|
	\| Dataset \| Synthetic — rendered from a vocabulary text file across multiple fonts with noise augmentation (Gaussian, salt-and-pepper, blur, JPEG compression) \|
	\| Train / Valid / Test split \| 80 / 10 / 10 \|

	---

	## Limitations

	- Designed for single text-line crops, not full documents or paragraphs.
	- Performance may degrade on handwritten text (trained on synthetic rendered images).
	- Very small fonts (< 10 px rendered height) may produce errors.

	---

	## License

	MIT