Update README.md

d200f98 verified 29 days ago

10 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- Darayut/khmer-document-synthetic-low-res
	- Darayut/khmer-scene-text-synthetic-contrast
	language:
	- km
	metrics:
	- cer
	pipeline_tag: image-to-text
	tags:
	- transformer
	- khmer-ocr
	- text-recognition
	- crnn
	- khmer-text-recognition
	---

	<div align="center">
	<img src="assets/netra.png" width="60%" alt="Netra Lab" />
	</div>

	<hr>

	<p align="center">
	<a href="https://github.com/netra-ai-lab/Khmer-OCR-CNN-Transformer"><b>GitHub</b></a> \|
	<a href="https://huggingface.co/Darayut/khmer-text-recognition"><b>Model Download</b></a> \|
	<a href="https://huggingface.co/collections/Darayut/khmer-text-synthetic"><b>Dataset Download</b></a> \|
	<a href="https://huggingface.co/spaces/Darayut/Khmer-Text-Recognition"><b>Inference Space</b></a> \|
	</p>


	<h2>
	<p align="center">
	<a href="">A Squeeze-and-Excitation Transformer Network for Khmer Optical Character Recognition</a>
	</p>
	</h2>

	<p align="center">
	<img src="assets/benchmark.png" style="width: 1000px" align=center>
	</p>

	<p align="center">
	<a href="">Character Error Rate (CER %) on KHOB, Legal Documents, and Printed Word Benchmark</a>
	</p>

	## Introduction

	This repository contains the implementation, datasets, and evaluation results for the Squeeze-and-Excitation Transformer Network, a high-performance Khmer Text Recognition model that utilizes a hybrid architecture combining Squeeze-and-Excitation blocks for feature extraction and BiLSTM smoothing for context smoothing, specifically designed to handle the complexity and length of Khmer script.

	## Overview

	Khmer script presents unique challenges for OCR due to its large character set, complex sub-consonant stacking, and variable text line lengths. This project employs an enhanced pipeline that:
	1. Chunks long text lines into manageable overlapping segments.
	2. Extracts Features using a Squeeze-and-Excitation Network (SE-VGG) that preserves horizontal spatial information.
	3. Encodes local spatial features using a Transformer Encoder.
	4. Merges the encoded chunks into a unified sequence.
	5. Smooths Context using a BiLSTM layer to resolve boundary discontinuities between chunks.
	6. Decodes the final sequence using a Transformer Decoder.

	## Datasets

	The model was trained entirely on synthetic data and evaluated on real-world datasets.

	### Training Data (Synthetic)
	We generated 200,000 synthetic images to ensure robustness against font variations and background noise.

	\| Dataset Type \| Count \| Generator / Source \| Augmentations \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Document Text \| 100,000 \| Pillow + Khmer Corpus \| Erosion, noise, thinning/thickening, perspective distortion. \|
	\| Scene Text \| 100,000 \| SynthTIGER + Stanford BG \| Rotation, blur, noise, realistic backgrounds. \|

	### Evaluation Data (Real-World + Synthetic)
	\| Dataset \| Type \| Size \| Description \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| KHOB \| Real \| 325 \| Standard benchmark, clean backgrounds but compression artifacts. \|
	\| Legal Documents \| Real \| 227 \| High variation in degradation, illumination, and distortion. \|
	\| Printed Words \| Synthetic \| 1,000 \| Short, isolated words in 10 different fonts. \|

	![Dataset Overview](./assets/dataset-overview.png)
	---

	## Methodology & Architecture

	### 1. Preprocessing: Chunking & Merging
	To handle variable-length text lines without aggressive resizing, we employ a "Chunk-and-Merge" strategy:
	* Resize: Input images are resized to a fixed height of 48 pixels while maintaining aspect ratio.
	* Chunking: The image is split into overlapping chunks (Size: 48x100 px, Overlap: 16 px).
	* Independent Encoding: Each chunk is processed independently by the CNN and Transformer Encoder to allow for parallel batch processing.

	### 2. Model Architecture: Squeeze-and-Excitation Transformer Network
	Our proposed architecture integrates sequence-aware attention and recurrent smoothing to overcome the limitations of standard chunk-based OCR. The model consists of six key modules:

	![Model Architecture](./assets/proposed-architecture.png)

	1. Squeeze-and-Excitation Network (SE-VGG):
	* A modified VGG backbone with 1D Squeeze-and-Excitation blocks after convolutional layer 3, 4, and 5.
	* Unlike standard SE, these blocks use vertical pooling to refine feature channels while strictly preserving the horizontal width (sequence information).

	![SE Module](<assets/Sequence Attention CNN.png>)


	2. Patch Module:
	* Projects spatial features into a condensed 384-dimensional embedding space.
	* Adds local positional encodings to preserve spatial order within chunks.

	3. Transformer Encoder:
	* Captures contextual relationships among visual tokens within each independent chunk.

	4. Merging Module:
	* Concatenates the encoded features from all chunks into a single unified sequence.
	* Adds Global Positional Embeddings to define the absolute position of tokens across the entire text line.

	5. BiLSTM Context Smoother:
	* A Bidirectional LSTM layer that processes the merged sequence.
	* Purpose: Bridges the "context gap" between independent chunks by smoothing boundary discontinuities, ensuring a seamless flow of information across the text line.

	![Context Smoothing Module](assets/BiLSTM-Module.png)

	6. Transformer Decoder:
	* Generates the final Khmer character sequence using the globally smoothed context.

	---

	## Training Configuration

	* Epochs: 100
	* Optimizer: Adam
	* Loss Function: Cross-Entropy Loss
	* Learning Rate Schedule: Staged Cyclic
	* Epoch 0-15: Fixed 1e-4 (Rapid convergence)
	* Epoch 16-30: Cyclic 1e-4 to 1e-5 (Stability)
	* Epoch 31-100: Cyclic 1e-5 to 1e-6 (Fine-tuning)
	* Sampling: 50,000 images randomly sampled/augmented per epoch.

	---

	## Quantitative Analysis

	We benchmarked our proposed model against VGG-Transformer, ResNet-Transformer, and Tesseract-OCR.

	Character Error Rate (CER %) - Lower is better

	TABLE 1: Character Error Rate (CER in %) results on the KHOB, Legal Documents, and Printed Word

	\| Model \| KHOB \| Legal Documents \| Printed Word \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Tesseract-OCR \| 6.24 \| 24.30 \| 8.02 \|
	\| VGG-Transformer \| 2.27 \| 10.27 \| 3.61 \|
	\| ResNet-Transformer \| 2.98 \| 11.57 \| 2.80 \|
	\| Proposed Model \| 1.87 \| 9.13 \| 2.46 \|

	---

	## Qualitative Analysis

	TABLE 2: Failure cases on KHOB, Legal Documents, and Printed Word
	![failure cases](assets/failure_cases.png)

	TABLE 3: Example of our proposed model against all baseline compared with the ground truth. Errors in the predictions are highlighted in red.
	![success cases](assets/sucess_case.png)

	Key Findings:
	* The Proposed Model achieves the highest accuracy on long, continuous text lines (KHOB), demonstrating that the BiLSTM Context Smoother effectively resolves the chunk boundary discontinuities that limit standard Transformer baselines.
	* On degraded and complex legal documents, the proposed model demonstrates superior robustness, significantly outperforming all baselines. This attributes to the Squeeze-and-Excitation blocks, which filter background noise while preserving character-specific features.
	* The Proposed Model still retains a slight advantage on short, isolated words even where global context is less critical, outperforming both ResNet and VGG Transformer baseline.

	---
	## Inference Usage

	You can load this model directly from Hugging Face using the `transformers` library. Because this is a custom architecture, you must set `trust_remote_code=True`.

	### 1. Setup
	```bash
	pip install torch torchvision transformers pillow huggingface_hub
	# Setup the inference script
	wget https://huggingface.co/Darayut/khmer-text-recognition/resolve/main/configuration_khmerocr.py
	wget https://huggingface.co/Darayut/khmer-text-recognition/resolve/main/inference.py
	```

	### 2. Run via Command Line
	```bash
	python inference.py --image "path/to/image.png" --method beam --beam_width 3
	```

	### 3. Run via Python
	```python
	from inference import KhmerOCR

	# Load Model (Downloads automatically)
	ocr = KhmerOCR()

	# Predict
	text = ocr.predict("test_image.jpg", method="beam", beam_width=3)
	print(text)
	```

	---
	## References

	1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
	Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al.
	ICLR 2021.
	[arXiv:2010.11929](https://arxiv.org/abs/2010.11929)

	2. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
	Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
	AAAI 2023.
	[arXiv:2109.10282](https://arxiv.org/abs/2109.10282)

	3. Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
	R. Buoy, M. Iwamura, S. Srun and K. Kise.
	IEEE Access, vol. 11, pp. 128044-128060, 2023.
	[DOI: 10.1109/ACCESS.2023.3332361](https://doi.org/10.1109/ACCESS.2023.3332361)

	4. Balraj98. (2018). Stanford background dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/balraj98/stanford-background-dataset

	5. EKYC Solutions. (2022). Khmer OCR benchmark dataset (KHOB) [Data set]. GitHub. https://github.com/EKYCSolutions/khmer-ocr-benchmark-dataset

	6. Em, H., Valy, D., Gosselin, B., & Kong, P. (2024). Khmer text recognition dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset

	7. Squeeze-and-Excitation Networks
	Jie Hu, Li Shen, and Gang Sun.
	CVPR 2018.
	[arXiv:1709.01507](https://arxiv.org/abs/1709.01507)

	8. Bidirectional Recurrent Neural Networks
	Mike Schuster and Kuldip K. Paliwal.
	IEEE Transactions on Signal Processing, 1997.
	[DOI: 10.1109/78.650093](https://doi.org/10.1109/78.650093)