GitHub | Model Download | Dataset Download | Inference Space |
A Squeeze-and-Excitation Transformer Network for Khmer Optical Character Recognition
Character Error Rate (CER %) on KHOB, Legal Documents, and Printed Word Benchmark
Introduction
This repository contains the implementation, datasets, and evaluation results for the Squeeze-and-Excitation Transformer Network, a high-performance Khmer Text Recognition model that utilizes a hybrid architecture combining Squeeze-and-Excitation blocks for feature extraction and BiLSTM smoothing for context smoothing, specifically designed to handle the complexity and length of Khmer script.
Overview
Khmer script presents unique challenges for OCR due to its large character set, complex sub-consonant stacking, and variable text line lengths. This project employs an enhanced pipeline that:
- Chunks long text lines into manageable overlapping segments.
- Extracts Features using a Squeeze-and-Excitation Network (SE-VGG) that preserves horizontal spatial information.
- Encodes local spatial features using a Transformer Encoder.
- Merges the encoded chunks into a unified sequence.
- Smooths Context using a BiLSTM layer to resolve boundary discontinuities between chunks.
- Decodes the final sequence using a Transformer Decoder.
Datasets
The model was trained entirely on synthetic data and evaluated on real-world datasets.
Training Data (Synthetic)
We generated 200,000 synthetic images to ensure robustness against font variations and background noise.
| Dataset Type | Count | Generator / Source | Augmentations |
|---|---|---|---|
| Document Text | 100,000 | Pillow + Khmer Corpus | Erosion, noise, thinning/thickening, perspective distortion. |
| Scene Text | 100,000 | SynthTIGER + Stanford BG | Rotation, blur, noise, realistic backgrounds. |
Evaluation Data (Real-World + Synthetic)
| Dataset | Type | Size | Description |
|---|---|---|---|
| KHOB | Real | 325 | Standard benchmark, clean backgrounds but compression artifacts. |
| Legal Documents | Real | 227 | High variation in degradation, illumination, and distortion. |
| Printed Words | Synthetic | 1,000 | Short, isolated words in 10 different fonts. |
Methodology & Architecture
1. Preprocessing: Chunking & Merging
To handle variable-length text lines without aggressive resizing, we employ a "Chunk-and-Merge" strategy:
- Resize: Input images are resized to a fixed height of 48 pixels while maintaining aspect ratio.
- Chunking: The image is split into overlapping chunks (Size: 48x100 px, Overlap: 16 px).
- Independent Encoding: Each chunk is processed independently by the CNN and Transformer Encoder to allow for parallel batch processing.
2. Model Architecture: Squeeze-and-Excitation Transformer Network
Our proposed architecture integrates sequence-aware attention and recurrent smoothing to overcome the limitations of standard chunk-based OCR. The model consists of six key modules:
Squeeze-and-Excitation Network (SE-VGG):
Patch Module:
- Projects spatial features into a condensed 384-dimensional embedding space.
- Adds local positional encodings to preserve spatial order within chunks.
Transformer Encoder:
- Captures contextual relationships among visual tokens within each independent chunk.
Merging Module:
- Concatenates the encoded features from all chunks into a single unified sequence.
- Adds Global Positional Embeddings to define the absolute position of tokens across the entire text line.
BiLSTM Context Smoother:
Transformer Decoder:
- Generates the final Khmer character sequence using the globally smoothed context.
Training Configuration
- Epochs: 100
- Optimizer: Adam
- Loss Function: Cross-Entropy Loss
- Learning Rate Schedule: Staged Cyclic
- Epoch 0-15: Fixed 1e-4 (Rapid convergence)
- Epoch 16-30: Cyclic 1e-4 to 1e-5 (Stability)
- Epoch 31-100: Cyclic 1e-5 to 1e-6 (Fine-tuning)
- Sampling: 50,000 images randomly sampled/augmented per epoch.
Quantitative Analysis
We benchmarked our proposed model against VGG-Transformer, ResNet-Transformer, and Tesseract-OCR.
Character Error Rate (CER %) - Lower is better
TABLE 1: Character Error Rate (CER in %) results on the KHOB, Legal Documents, and Printed Word
| Model | KHOB | Legal Documents | Printed Word |
|---|---|---|---|
| Tesseract-OCR | 6.24 | 24.30 | 8.02 |
| VGG-Transformer | 2.27 | 10.27 | 3.61 |
| ResNet-Transformer | 2.98 | 11.57 | 2.80 |
| Proposed Model | 1.87 | 9.13 | 2.46 |
Qualitative Analysis
TABLE 2: Failure cases on KHOB, Legal Documents, and Printed Word

TABLE 3: Example of our proposed model against all baseline compared with the ground truth. Errors in the predictions are highlighted in red.

Key Findings:
- The Proposed Model achieves the highest accuracy on long, continuous text lines (KHOB), demonstrating that the BiLSTM Context Smoother effectively resolves the chunk boundary discontinuities that limit standard Transformer baselines.
- On degraded and complex legal documents, the proposed model demonstrates superior robustness, significantly outperforming all baselines. This attributes to the Squeeze-and-Excitation blocks, which filter background noise while preserving character-specific features.
- The Proposed Model still retains a slight advantage on short, isolated words even where global context is less critical, outperforming both ResNet and VGG Transformer baseline.
Inference Usage
You can load this model directly from Hugging Face using the transformers library. Because this is a custom architecture, you must set trust_remote_code=True.
1. Setup
pip install torch torchvision transformers pillow huggingface_hub
# Setup the inference script
wget https://huggingface.co/Darayut/khmer-text-recognition/resolve/main/configuration_khmerocr.py
wget https://huggingface.co/Darayut/khmer-text-recognition/resolve/main/inference.py
2. Run via Command Line
python inference.py --image "path/to/image.png" --method beam --beam_width 3
3. Run via Python
from inference import KhmerOCR
# Load Model (Downloads automatically)
ocr = KhmerOCR()
# Predict
text = ocr.predict("test_image.jpg", method="beam", beam_width=3)
print(text)
References
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al.
ICLR 2021.
arXiv:2010.11929TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
AAAI 2023.
arXiv:2109.10282Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
R. Buoy, M. Iwamura, S. Srun and K. Kise.
IEEE Access, vol. 11, pp. 128044-128060, 2023.
DOI: 10.1109/ACCESS.2023.3332361Balraj98. (2018). Stanford background dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/balraj98/stanford-background-dataset
EKYC Solutions. (2022). Khmer OCR benchmark dataset (KHOB) [Data set]. GitHub. https://github.com/EKYCSolutions/khmer-ocr-benchmark-dataset
Em, H., Valy, D., Gosselin, B., & Kong, P. (2024). Khmer text recognition dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset
Squeeze-and-Excitation Networks
Jie Hu, Li Shen, and Gang Sun.
CVPR 2018.
arXiv:1709.01507Bidirectional Recurrent Neural Networks
Mike Schuster and Kuldip K. Paliwal.
IEEE Transactions on Signal Processing, 1997.
DOI: 10.1109/78.650093
- Downloads last month
- 171



