| # Khmer OCR Recognition Model | |
| ๐ฐ๐ญ **High-accuracy OCR model for Khmer text recognition using PaddleOCR framework** | |
| ## Model Overview | |
| This CRNN-based OCR model is specifically trained for Khmer (Cambodian) text recognition, achieving **98.45% accuracy** on validation data. The model is optimized for recognizing short text segments (3-5 words) commonly found in documents, signs, and printed materials. | |
| ## ๐๏ธ Model Architecture | |
| - **Framework**: PaddleOCR 2.7+ | |
| - **Algorithm**: CRNN (Convolutional Recurrent Neural Network) | |
| - **Backbone**: ResNet34 | |
| - **Neck**: SequenceEncoder with RNN (hidden_size: 256) | |
| - **Head**: CTCHead with CTC Loss | |
| - **Input Shape**: `[3, 32, 320]` (channels, height, width) | |
| - **Max Text Length**: 25 characters | |
| ## ๐ Supported Characters | |
| The model recognizes **188 characters** including: | |
| - **Khmer Consonants**: แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แ แก แข | |
| - **Khmer Vowels**: แถ แท แธ แน แบ แป แผ แฝ แพ แฟ แ แ แ แ แ แ แ แ แ | |
| - **Khmer Numerals**: แ แก แข แฃ แค แฅ แฆ แง แจ แฉ | |
| - **Latin Characters**: A-Z, a-z, 0-9 | |
| - **Punctuation**: . , ! ? - ( ) [ ] ยซ ยป โข ยฎ etc. | |
| - **Khmer Symbols**: แ แ แ แ แ แ แ แ แ แ แ แ | |
| ## ๐ Quick Start | |
| ### Installation | |
| ```bash | |
| pip install paddlepaddle paddleocr opencv-python | |
| ``` | |
| ### Basic Usage | |
| ```python | |
| from paddleocr import PaddleOCR | |
| import cv2 | |
| # Initialize OCR with custom Khmer model | |
| ocr = PaddleOCR( | |
| use_angle_cls=True, | |
| lang='ch', # Use Chinese as base language | |
| rec_model_dir='path/to/model', # Directory containing inference files | |
| rec_char_dict_path='khmer_char_dict.txt', | |
| show_log=False | |
| ) | |
| # Process image | |
| result = ocr.ocr('khmer_text_image.jpg', cls=True) | |
| # Extract results | |
| for idx in range(len(result)): | |
| res = result[idx] | |
| if res is None: | |
| continue | |
| for line in res: | |
| text = line[1][0] # Recognized text | |
| confidence = line[1][1] # Confidence score | |
| print(f'Text: {text}, Confidence: {confidence:.3f}') | |
| ``` | |
| ### Command Line Usage | |
| ```bash | |
| # Download model files to a directory | |
| # Then use PaddleOCR tools: | |
| python tools/infer/predict_rec.py \ | |
| --image_dir="your_khmer_image.png" \ | |
| --rec_model_dir="path/to/model" \ | |
| --rec_char_dict_path="khmer_char_dict.txt" | |
| ``` | |
| ## ๐ Files Included | |
| | File | Size | Description | | |
| |------|------|-------------| | |
| | `inference.pdiparams` | ~106MB | Main model weights | | |
| | `inference.yml` | ~2KB | Model configuration | | |
| | `inference.json` | ~1KB | Model metadata | | |
| | `khmer_char_dict.txt` | ~2KB | Character dictionary (188 characters) | | |
| | `training_config.yml` | ~2KB | Original training configuration | | |
| ## ๐ง Training Details | |
| ### Dataset Characteristics | |
| - **Text Length**: 3-5 words per image (optimized for short segments) | |
| - **Image Size**: 600ร80 pixels (training), resized to 320ร32 for inference | |
| - **Font**: KhmerOS TTF | |
| - **Background**: White background with black text | |
| - **Augmentation**: Clean, blurred, noisy, and noise+blur variants | |
| ### Training Configuration | |
| - **Epochs**: 30 (best model at epoch 29) | |
| - **Optimizer**: Adam with ฮฒโ=0.9, ฮฒโ=0.999 | |
| - **Learning Rate**: Cosine scheduling (initial: 0.001) | |
| - **Batch Size**: 32 | |
| - **Loss Function**: CTC Loss | |
| - **Regularization**: L2 (factor: 4e-05) | |
| ## ๐ก Usage Tips | |
| ### Best Practices | |
| 1. **Image Quality**: Use high-contrast images with clear text | |
| 2. **Text Length**: Optimal for 3-5 word segments (model's training focus) | |
| 3. **Resolution**: Images should be reasonably sized (not too small) | |
| 4. **Preprocessing**: Consider using text detection for full documents | |
| ### For Long Text Documents | |
| Since this model is optimized for short segments, for full documents: | |
| 1. **Use Text Detection**: Combine with PaddleOCR's detection model | |
| 2. **Segment Text**: Break long lines into 3-5 word chunks | |
| 3. **Post-process**: Combine results from multiple segments | |
| ```python | |
| # Example for full document processing | |
| ocr = PaddleOCR( | |
| use_angle_cls=True, | |
| lang='ch', | |
| det_model_dir='path/to/detection/model', # Add detection model | |
| rec_model_dir='path/to/this/model', # This Khmer recognition model | |
| rec_char_dict_path='khmer_char_dict.txt' | |
| ) | |
| # This will detect text regions AND recognize them | |
| result = ocr.ocr('full_document.jpg', cls=True) | |
| ``` | |
| ## ๐ Model Conversion | |
| This model was exported from PaddlePaddle training format to inference format: | |
| ```bash | |
| # Original export command used: | |
| python tools/export_model.py \ | |
| -c pretrainoutput/config.yml \ | |
| -o Global.pretrained_model=pretrainoutput/best_accuracy.pdparams \ | |
| Global.save_inference_dir=pretrainoutput/inference | |
| ``` | |
| ## ๐ ๏ธ Requirements | |
| ``` | |
| paddlepaddle>=2.4.0 | |
| opencv-python>=4.5.0 | |
| numpy>=1.19.0 | |
| pillow>=8.0.0 | |
| ``` | |
| ```bibtex | |
| @misc{khmer-ocr-2025, | |
| title={Khmer OCR Recognition Model}, | |
| author={[Your Name]}, | |
| year={2025}, | |
| publisher={Hugging Face}, | |
| howpublished={\url{https://huggingface.co/[your-username]/khmer-ocr}} | |
| } | |
| ``` | |