# Khmer OCR Recognition Model

🇰🇭 **High-accuracy OCR model for Khmer text recognition using PaddleOCR framework**

## Model Overview

This CRNN-based OCR model is specifically trained for Khmer (Cambodian) text recognition, achieving **98.45% accuracy** on validation data. The model is optimized for recognizing short text segments (3-5 words) commonly found in documents, signs, and printed materials.

## 🏗️ Model Architecture

- **Framework**: PaddleOCR 2.7+
- **Algorithm**: CRNN (Convolutional Recurrent Neural Network)
- **Backbone**: ResNet34
- **Neck**: SequenceEncoder with RNN (hidden_size: 256)
- **Head**: CTCHead with CTC Loss
- **Input Shape**: `[3, 32, 320]` (channels, height, width)
- **Max Text Length**: 25 characters

## 📝 Supported Characters

The model recognizes **188 characters** including:

- **Khmer Consonants**: ក ខ គ ឃ ង ច ឆ ជ ឈ ញ ដ ឋ ឌ ឍ ណ ត ថ ទ ធ ន ប ផ ព ភ ម យ រ ល វ ស ហ ឡ អ
- **Khmer Vowels**: ា ិ ី ឹ ឺ ុ ូ ួ ើ ឿ ៀ េ ែ ៃ ោ ៅ ំ ះ ៈ
- **Khmer Numerals**: ០ ១ ២ ៣ ៤ ៥ ៦ ៧ ៨ ៩
- **Latin Characters**: A-Z, a-z, 0-9
- **Punctuation**: . , ! ? - ( ) [ ] « » ™ ® etc.
- **Khmer Symbols**: ។ ៕ ៖ ៗ ៉ ៊ ់ ៌ ៍ ៏ ័ ្

## 🚀 Quick Start

### Installation

```bash
pip install paddlepaddle paddleocr opencv-python
```

### Basic Usage

```python
from paddleocr import PaddleOCR
import cv2

# Initialize OCR with custom Khmer model
ocr = PaddleOCR(
    use_angle_cls=True,
    lang='ch',  # Use Chinese as base language
    rec_model_dir='path/to/model',  # Directory containing inference files
    rec_char_dict_path='khmer_char_dict.txt',
    show_log=False
)

# Process image
result = ocr.ocr('khmer_text_image.jpg', cls=True)

# Extract results
for idx in range(len(result)):
    res = result[idx]
    if res is None:
        continue
    for line in res:
        text = line[1][0]  # Recognized text
        confidence = line[1][1]  # Confidence score
        print(f'Text: {text}, Confidence: {confidence:.3f}')
```

### Command Line Usage

```bash
# Download model files to a directory
# Then use PaddleOCR tools:

python tools/infer/predict_rec.py \
    --image_dir="your_khmer_image.png" \
    --rec_model_dir="path/to/model" \
    --rec_char_dict_path="khmer_char_dict.txt"
```

## 📁 Files Included

| File | Size | Description |
|------|------|-------------|
| `inference.pdiparams` | ~106MB | Main model weights |
| `inference.yml` | ~2KB | Model configuration |
| `inference.json` | ~1KB | Model metadata |
| `khmer_char_dict.txt` | ~2KB | Character dictionary (188 characters) |
| `training_config.yml` | ~2KB | Original training configuration |

## 🔧 Training Details

### Dataset Characteristics
- **Text Length**: 3-5 words per image (optimized for short segments)
- **Image Size**: 600×80 pixels (training), resized to 320×32 for inference
- **Font**: KhmerOS TTF
- **Background**: White background with black text
- **Augmentation**: Clean, blurred, noisy, and noise+blur variants

### Training Configuration
- **Epochs**: 30 (best model at epoch 29)
- **Optimizer**: Adam with β₁=0.9, β₂=0.999
- **Learning Rate**: Cosine scheduling (initial: 0.001)
- **Batch Size**: 32
- **Loss Function**: CTC Loss
- **Regularization**: L2 (factor: 4e-05)

## 💡 Usage Tips

### Best Practices
1. **Image Quality**: Use high-contrast images with clear text
2. **Text Length**: Optimal for 3-5 word segments (model's training focus)
3. **Resolution**: Images should be reasonably sized (not too small)
4. **Preprocessing**: Consider using text detection for full documents

### For Long Text Documents
Since this model is optimized for short segments, for full documents:

1. **Use Text Detection**: Combine with PaddleOCR's detection model
2. **Segment Text**: Break long lines into 3-5 word chunks
3. **Post-process**: Combine results from multiple segments

```python
# Example for full document processing
ocr = PaddleOCR(
    use_angle_cls=True,
    lang='ch',
    det_model_dir='path/to/detection/model',  # Add detection model
    rec_model_dir='path/to/this/model',       # This Khmer recognition model
    rec_char_dict_path='khmer_char_dict.txt'
)

# This will detect text regions AND recognize them
result = ocr.ocr('full_document.jpg', cls=True)
```


## 🔄 Model Conversion

This model was exported from PaddlePaddle training format to inference format:

```bash
# Original export command used:
python tools/export_model.py \
    -c pretrainoutput/config.yml \
    -o Global.pretrained_model=pretrainoutput/best_accuracy.pdparams \
    Global.save_inference_dir=pretrainoutput/inference
```

## 🛠️ Requirements

```
paddlepaddle>=2.4.0
opencv-python>=4.5.0
numpy>=1.19.0
pillow>=8.0.0
```

```bibtex
@misc{khmer-ocr-2025,
  title={Khmer OCR Recognition Model},
  author={[Your Name]},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/[your-username]/khmer-ocr}}
}
```