# Khmer OCR Recognition Model ๐Ÿ‡ฐ๐Ÿ‡ญ **High-accuracy OCR model for Khmer text recognition using PaddleOCR framework** ## Model Overview This CRNN-based OCR model is specifically trained for Khmer (Cambodian) text recognition, achieving **98.45% accuracy** on validation data. The model is optimized for recognizing short text segments (3-5 words) commonly found in documents, signs, and printed materials. ## ๐Ÿ—๏ธ Model Architecture - **Framework**: PaddleOCR 2.7+ - **Algorithm**: CRNN (Convolutional Recurrent Neural Network) - **Backbone**: ResNet34 - **Neck**: SequenceEncoder with RNN (hidden_size: 256) - **Head**: CTCHead with CTC Loss - **Input Shape**: `[3, 32, 320]` (channels, height, width) - **Max Text Length**: 25 characters ## ๐Ÿ“ Supported Characters The model recognizes **188 characters** including: - **Khmer Consonants**: แž€ แž แž‚ แžƒ แž„ แž… แž† แž‡ แžˆ แž‰ แžŠ แž‹ แžŒ แž แžŽ แž แž แž‘ แž’ แž“ แž” แž• แž– แž— แž˜ แž™ แžš แž› แžœ แžŸ แž  แžก แžข - **Khmer Vowels**: แžถ แžท แžธ แžน แžบ แžป แžผ แžฝ แžพ แžฟ แŸ€ แŸ แŸ‚ แŸƒ แŸ„ แŸ… แŸ† แŸ‡ แŸˆ - **Khmer Numerals**: แŸ  แŸก แŸข แŸฃ แŸค แŸฅ แŸฆ แŸง แŸจ แŸฉ - **Latin Characters**: A-Z, a-z, 0-9 - **Punctuation**: . , ! ? - ( ) [ ] ยซ ยป โ„ข ยฎ etc. - **Khmer Symbols**: แŸ” แŸ• แŸ– แŸ— แŸ‰ แŸŠ แŸ‹ แŸŒ แŸ แŸ แŸ แŸ’ ## ๐Ÿš€ Quick Start ### Installation ```bash pip install paddlepaddle paddleocr opencv-python ``` ### Basic Usage ```python from paddleocr import PaddleOCR import cv2 # Initialize OCR with custom Khmer model ocr = PaddleOCR( use_angle_cls=True, lang='ch', # Use Chinese as base language rec_model_dir='path/to/model', # Directory containing inference files rec_char_dict_path='khmer_char_dict.txt', show_log=False ) # Process image result = ocr.ocr('khmer_text_image.jpg', cls=True) # Extract results for idx in range(len(result)): res = result[idx] if res is None: continue for line in res: text = line[1][0] # Recognized text confidence = line[1][1] # Confidence score print(f'Text: {text}, Confidence: {confidence:.3f}') ``` ### Command Line Usage ```bash # Download model files to a directory # Then use PaddleOCR tools: python tools/infer/predict_rec.py \ --image_dir="your_khmer_image.png" \ --rec_model_dir="path/to/model" \ --rec_char_dict_path="khmer_char_dict.txt" ``` ## ๐Ÿ“ Files Included | File | Size | Description | |------|------|-------------| | `inference.pdiparams` | ~106MB | Main model weights | | `inference.yml` | ~2KB | Model configuration | | `inference.json` | ~1KB | Model metadata | | `khmer_char_dict.txt` | ~2KB | Character dictionary (188 characters) | | `training_config.yml` | ~2KB | Original training configuration | ## ๐Ÿ”ง Training Details ### Dataset Characteristics - **Text Length**: 3-5 words per image (optimized for short segments) - **Image Size**: 600ร—80 pixels (training), resized to 320ร—32 for inference - **Font**: KhmerOS TTF - **Background**: White background with black text - **Augmentation**: Clean, blurred, noisy, and noise+blur variants ### Training Configuration - **Epochs**: 30 (best model at epoch 29) - **Optimizer**: Adam with ฮฒโ‚=0.9, ฮฒโ‚‚=0.999 - **Learning Rate**: Cosine scheduling (initial: 0.001) - **Batch Size**: 32 - **Loss Function**: CTC Loss - **Regularization**: L2 (factor: 4e-05) ## ๐Ÿ’ก Usage Tips ### Best Practices 1. **Image Quality**: Use high-contrast images with clear text 2. **Text Length**: Optimal for 3-5 word segments (model's training focus) 3. **Resolution**: Images should be reasonably sized (not too small) 4. **Preprocessing**: Consider using text detection for full documents ### For Long Text Documents Since this model is optimized for short segments, for full documents: 1. **Use Text Detection**: Combine with PaddleOCR's detection model 2. **Segment Text**: Break long lines into 3-5 word chunks 3. **Post-process**: Combine results from multiple segments ```python # Example for full document processing ocr = PaddleOCR( use_angle_cls=True, lang='ch', det_model_dir='path/to/detection/model', # Add detection model rec_model_dir='path/to/this/model', # This Khmer recognition model rec_char_dict_path='khmer_char_dict.txt' ) # This will detect text regions AND recognize them result = ocr.ocr('full_document.jpg', cls=True) ``` ## ๐Ÿ”„ Model Conversion This model was exported from PaddlePaddle training format to inference format: ```bash # Original export command used: python tools/export_model.py \ -c pretrainoutput/config.yml \ -o Global.pretrained_model=pretrainoutput/best_accuracy.pdparams \ Global.save_inference_dir=pretrainoutput/inference ``` ## ๐Ÿ› ๏ธ Requirements ``` paddlepaddle>=2.4.0 opencv-python>=4.5.0 numpy>=1.19.0 pillow>=8.0.0 ``` ```bibtex @misc{khmer-ocr-2025, title={Khmer OCR Recognition Model}, author={[Your Name]}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/[your-username]/khmer-ocr}} } ```