README.md · chayuto/thai-id-ocr-crnn-numeric-reader at main

thai-id-ocr-crnn-numeric-reader / README.md

chayuto

Upload README.md with huggingface_hub

10316b3 verified about 1 month ago

preview code

raw

history blame contribute delete

4.48 kB

	---
	library_name: pytorch
	license: mit
	language:
	- th
	- en
	tags:
	- ocr
	- text-recognition
	- thai-id-card
	- crnn
	- ctc
	- on-device
	- mobile
	- numeric-ocr
	- citizen-id
	pipeline_tag: image-to-text
	---

	# Thai ID Nano OCR — Numeric OCR Reader (SimpleCRNN (MVP))

	> MVP model. Production upgrade: swap to `ppocrv5` variant (same interface,
	> better accuracy). See `config.json` → `architecture_variant` for programmatic detection.

	CTC-based text recognition model for Thai National ID card numeric fields,
	designed for on-device inference at 30fps on mobile.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Architecture \| SimpleCRNN (MVP) \|
	\| Variant \| `crnn` \|
	\| ExactMatch \| 98.6% \|
	\| CharAccuracy \| 99.4% \|
	\| Parameters \| 3,026,703 \|
	\| Vocab size \| 15 \|
	\| Best epoch \| 10 \|

	## Quick Start

	```python
	from huggingface_hub import hf_hub_download

	model_path = hf_hub_download("chayuto/thai-id-ocr-crnn-numeric-reader", "model.pt")
	vocab_path = hf_hub_download("chayuto/thai-id-ocr-crnn-numeric-reader", "vocab.txt")
	config = hf_hub_download("chayuto/thai-id-ocr-crnn-numeric-reader", "config.json")
	```

	## Architecture

	SimpleCRNN — CNN (4-layer) + BiLSTM (2-layer) + CTC decoder.

	```
	Input: [B, 3, 48, 320] (RGB, normalized to [-1, 1])
	→ CNN: 32→64→128→256 channels, BatchNorm+ReLU, MaxPool(2,2)×3
	→ AdaptiveAvgPool2d((1, None)) → T=40 time steps
	→ BiLSTM: hidden=256, layers=2, dropout=0.1
	→ Linear(512 → 15)
	→ CTC decode (blank=0, collapse repeats)
	Output: Unicode string
	```

	## Field Details

	- Zones: `num_id_zone` (13-digit CID), `num_dob_zone` (DD/MM/YYYY)
	- Charset: `0123456789/- .` (14 chars + CTC blank)
	- Post-validation: CID Modulo 11 checksum on digit 13

	## Input Preprocessing

	```python
	import cv2
	import numpy as np

	def preprocess(img_path, height=48, max_width=320):
	img = cv2.imread(img_path)
	h, w = img.shape[:2]
	ratio = height / h
	new_w = min(int(w * ratio), max_width)
	img = cv2.resize(img, (new_w, height))
	# Pad to max_width with white
	if new_w < max_width:
	pad = np.full((height, max_width - new_w, 3), 255, dtype=np.uint8)
	img = np.concatenate([img, pad], axis=1)
	# Normalize to [-1, 1]
	img = img.astype(np.float32) / 255.0
	img = (img - 0.5) / 0.5
	return np.transpose(img, (2, 0, 1)) # CHW
	```

	## CTC Decoding

	```python
	def ctc_decode(indices, vocab_chars, blank_idx=0):
	chars, prev = [], -1
	for idx in indices:
	if idx != blank_idx and idx != prev:
	if 1 <= idx <= len(vocab_chars):
	chars.append(vocab_chars[idx - 1])
	prev = idx
	return "".join(chars)
	```

	## Loading the Model

	```python
	import torch
	import torch.nn as nn

	class SimpleCRNN(nn.Module):
	def __init__(self, num_classes, img_h=48):
	super().__init__()
	self.cnn = nn.Sequential(
	nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2, 2),
	nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2, 2),
	nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), nn.MaxPool2d(2, 2),
	nn.Conv2d(128, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(),
	nn.AdaptiveAvgPool2d((1, None)),
	)
	self.rnn = nn.LSTM(256, 256, num_layers=2, bidirectional=True, batch_first=True, dropout=0.1)
	self.fc = nn.Linear(512, num_classes)

	def forward(self, x):
	features = self.cnn(x).squeeze(2).permute(0, 2, 1)
	rnn_out, _ = self.rnn(features)
	return self.fc(rnn_out).permute(1, 0, 2) # (T, B, C) for CTC

	model = SimpleCRNN(num_classes=15)
	model.load_state_dict(torch.load(model_path, map_location="cpu"))
	model.eval()
	```

	## Pipeline Context

	This model is one of 3 Reader experts in the Thai ID Nano OCR pipeline:

	```
	Camera Frame → YOLO26n Finder (5-class, single pass)
	→ num_id_zone, num_dob_zone → Numeric Reader
	→ text_eng_zone → English Reader
	→ text_thai_zone → Thai Reader
	→ Validator (Mod11 checksum, date logic)
	```

	Total pipeline: <15 MB, 30fps on mobile.

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.pt` \| PyTorch `state_dict` (~12 MB) \|
	\| `vocab.txt` \| Character vocabulary, one per line (`<space>` = space). CTC blank is implicit at index 0. \|
	\| `config.json` \| Architecture params, training metadata, charset \|

	## License

	MIT