Thareah's picture
Upload folder using huggingface_hub
37795b9 verified
# Khmer OCR Recognition Model
๐Ÿ‡ฐ๐Ÿ‡ญ **High-accuracy OCR model for Khmer text recognition using PaddleOCR framework**
## Model Overview
This CRNN-based OCR model is specifically trained for Khmer (Cambodian) text recognition, achieving **98.45% accuracy** on validation data. The model is optimized for recognizing short text segments (3-5 words) commonly found in documents, signs, and printed materials.
## ๐Ÿ—๏ธ Model Architecture
- **Framework**: PaddleOCR 2.7+
- **Algorithm**: CRNN (Convolutional Recurrent Neural Network)
- **Backbone**: ResNet34
- **Neck**: SequenceEncoder with RNN (hidden_size: 256)
- **Head**: CTCHead with CTC Loss
- **Input Shape**: `[3, 32, 320]` (channels, height, width)
- **Max Text Length**: 25 characters
## ๐Ÿ“ Supported Characters
The model recognizes **188 characters** including:
- **Khmer Consonants**: แž€ แž แž‚ แžƒ แž„ แž… แž† แž‡ แžˆ แž‰ แžŠ แž‹ แžŒ แž แžŽ แž แž แž‘ แž’ แž“ แž” แž• แž– แž— แž˜ แž™ แžš แž› แžœ แžŸ แž  แžก แžข
- **Khmer Vowels**: แžถ แžท แžธ แžน แžบ แžป แžผ แžฝ แžพ แžฟ แŸ€ แŸ แŸ‚ แŸƒ แŸ„ แŸ… แŸ† แŸ‡ แŸˆ
- **Khmer Numerals**: แŸ  แŸก แŸข แŸฃ แŸค แŸฅ แŸฆ แŸง แŸจ แŸฉ
- **Latin Characters**: A-Z, a-z, 0-9
- **Punctuation**: . , ! ? - ( ) [ ] ยซ ยป โ„ข ยฎ etc.
- **Khmer Symbols**: แŸ” แŸ• แŸ– แŸ— แŸ‰ แŸŠ แŸ‹ แŸŒ แŸ แŸ แŸ แŸ’
## ๐Ÿš€ Quick Start
### Installation
```bash
pip install paddlepaddle paddleocr opencv-python
```
### Basic Usage
```python
from paddleocr import PaddleOCR
import cv2
# Initialize OCR with custom Khmer model
ocr = PaddleOCR(
use_angle_cls=True,
lang='ch', # Use Chinese as base language
rec_model_dir='path/to/model', # Directory containing inference files
rec_char_dict_path='khmer_char_dict.txt',
show_log=False
)
# Process image
result = ocr.ocr('khmer_text_image.jpg', cls=True)
# Extract results
for idx in range(len(result)):
res = result[idx]
if res is None:
continue
for line in res:
text = line[1][0] # Recognized text
confidence = line[1][1] # Confidence score
print(f'Text: {text}, Confidence: {confidence:.3f}')
```
### Command Line Usage
```bash
# Download model files to a directory
# Then use PaddleOCR tools:
python tools/infer/predict_rec.py \
--image_dir="your_khmer_image.png" \
--rec_model_dir="path/to/model" \
--rec_char_dict_path="khmer_char_dict.txt"
```
## ๐Ÿ“ Files Included
| File | Size | Description |
|------|------|-------------|
| `inference.pdiparams` | ~106MB | Main model weights |
| `inference.yml` | ~2KB | Model configuration |
| `inference.json` | ~1KB | Model metadata |
| `khmer_char_dict.txt` | ~2KB | Character dictionary (188 characters) |
| `training_config.yml` | ~2KB | Original training configuration |
## ๐Ÿ”ง Training Details
### Dataset Characteristics
- **Text Length**: 3-5 words per image (optimized for short segments)
- **Image Size**: 600ร—80 pixels (training), resized to 320ร—32 for inference
- **Font**: KhmerOS TTF
- **Background**: White background with black text
- **Augmentation**: Clean, blurred, noisy, and noise+blur variants
### Training Configuration
- **Epochs**: 30 (best model at epoch 29)
- **Optimizer**: Adam with ฮฒโ‚=0.9, ฮฒโ‚‚=0.999
- **Learning Rate**: Cosine scheduling (initial: 0.001)
- **Batch Size**: 32
- **Loss Function**: CTC Loss
- **Regularization**: L2 (factor: 4e-05)
## ๐Ÿ’ก Usage Tips
### Best Practices
1. **Image Quality**: Use high-contrast images with clear text
2. **Text Length**: Optimal for 3-5 word segments (model's training focus)
3. **Resolution**: Images should be reasonably sized (not too small)
4. **Preprocessing**: Consider using text detection for full documents
### For Long Text Documents
Since this model is optimized for short segments, for full documents:
1. **Use Text Detection**: Combine with PaddleOCR's detection model
2. **Segment Text**: Break long lines into 3-5 word chunks
3. **Post-process**: Combine results from multiple segments
```python
# Example for full document processing
ocr = PaddleOCR(
use_angle_cls=True,
lang='ch',
det_model_dir='path/to/detection/model', # Add detection model
rec_model_dir='path/to/this/model', # This Khmer recognition model
rec_char_dict_path='khmer_char_dict.txt'
)
# This will detect text regions AND recognize them
result = ocr.ocr('full_document.jpg', cls=True)
```
## ๐Ÿ”„ Model Conversion
This model was exported from PaddlePaddle training format to inference format:
```bash
# Original export command used:
python tools/export_model.py \
-c pretrainoutput/config.yml \
-o Global.pretrained_model=pretrainoutput/best_accuracy.pdparams \
Global.save_inference_dir=pretrainoutput/inference
```
## ๐Ÿ› ๏ธ Requirements
```
paddlepaddle>=2.4.0
opencv-python>=4.5.0
numpy>=1.19.0
pillow>=8.0.0
```
```bibtex
@misc{khmer-ocr-2025,
title={Khmer OCR Recognition Model},
author={[Your Name]},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/[your-username]/khmer-ocr}}
}
```