YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Khmer OCR Recognition Model

๐Ÿ‡ฐ๐Ÿ‡ญ High-accuracy OCR model for Khmer text recognition using PaddleOCR framework

Model Overview

This CRNN-based OCR model is specifically trained for Khmer (Cambodian) text recognition, achieving 98.45% accuracy on validation data. The model is optimized for recognizing short text segments (3-5 words) commonly found in documents, signs, and printed materials.

๐Ÿ—๏ธ Model Architecture

  • Framework: PaddleOCR 2.7+
  • Algorithm: CRNN (Convolutional Recurrent Neural Network)
  • Backbone: ResNet34
  • Neck: SequenceEncoder with RNN (hidden_size: 256)
  • Head: CTCHead with CTC Loss
  • Input Shape: [3, 32, 320] (channels, height, width)
  • Max Text Length: 25 characters

๐Ÿ“ Supported Characters

The model recognizes 188 characters including:

  • Khmer Consonants: แž€ แž แž‚ แžƒ แž„ แž… แž† แž‡ แžˆ แž‰ แžŠ แž‹ แžŒ แž แžŽ แž แž แž‘ แž’ แž“ แž” แž• แž– แž— แž˜ แž™ แžš แž› แžœ แžŸ แž  แžก แžข
  • Khmer Vowels: แžถ แžท แžธ แžน แžบ แžป แžผ แžฝ แžพ แžฟ แŸ€ แŸ แŸ‚ แŸƒ แŸ„ แŸ… แŸ† แŸ‡ แŸˆ
  • Khmer Numerals: แŸ  แŸก แŸข แŸฃ แŸค แŸฅ แŸฆ แŸง แŸจ แŸฉ
  • Latin Characters: A-Z, a-z, 0-9
  • Punctuation: . , ! ? - ( ) [ ] ยซ ยป โ„ข ยฎ etc.
  • Khmer Symbols: แŸ” แŸ• แŸ– แŸ— แŸ‰ แŸŠ แŸ‹ แŸŒ แŸ แŸ แŸ แŸ’

๐Ÿš€ Quick Start

Installation

pip install paddlepaddle paddleocr opencv-python

Basic Usage

from paddleocr import PaddleOCR
import cv2

# Initialize OCR with custom Khmer model
ocr = PaddleOCR(
    use_angle_cls=True,
    lang='ch',  # Use Chinese as base language
    rec_model_dir='path/to/model',  # Directory containing inference files
    rec_char_dict_path='khmer_char_dict.txt',
    show_log=False
)

# Process image
result = ocr.ocr('khmer_text_image.jpg', cls=True)

# Extract results
for idx in range(len(result)):
    res = result[idx]
    if res is None:
        continue
    for line in res:
        text = line[1][0]  # Recognized text
        confidence = line[1][1]  # Confidence score
        print(f'Text: {text}, Confidence: {confidence:.3f}')

Command Line Usage

# Download model files to a directory
# Then use PaddleOCR tools:

python tools/infer/predict_rec.py \
    --image_dir="your_khmer_image.png" \
    --rec_model_dir="path/to/model" \
    --rec_char_dict_path="khmer_char_dict.txt"

๐Ÿ“ Files Included

File Size Description
inference.pdiparams ~106MB Main model weights
inference.yml ~2KB Model configuration
inference.json ~1KB Model metadata
khmer_char_dict.txt ~2KB Character dictionary (188 characters)
training_config.yml ~2KB Original training configuration

๐Ÿ”ง Training Details

Dataset Characteristics

  • Text Length: 3-5 words per image (optimized for short segments)
  • Image Size: 600ร—80 pixels (training), resized to 320ร—32 for inference
  • Font: KhmerOS TTF
  • Background: White background with black text
  • Augmentation: Clean, blurred, noisy, and noise+blur variants

Training Configuration

  • Epochs: 30 (best model at epoch 29)
  • Optimizer: Adam with ฮฒโ‚=0.9, ฮฒโ‚‚=0.999
  • Learning Rate: Cosine scheduling (initial: 0.001)
  • Batch Size: 32
  • Loss Function: CTC Loss
  • Regularization: L2 (factor: 4e-05)

๐Ÿ’ก Usage Tips

Best Practices

  1. Image Quality: Use high-contrast images with clear text
  2. Text Length: Optimal for 3-5 word segments (model's training focus)
  3. Resolution: Images should be reasonably sized (not too small)
  4. Preprocessing: Consider using text detection for full documents

For Long Text Documents

Since this model is optimized for short segments, for full documents:

  1. Use Text Detection: Combine with PaddleOCR's detection model
  2. Segment Text: Break long lines into 3-5 word chunks
  3. Post-process: Combine results from multiple segments
# Example for full document processing
ocr = PaddleOCR(
    use_angle_cls=True,
    lang='ch',
    det_model_dir='path/to/detection/model',  # Add detection model
    rec_model_dir='path/to/this/model',       # This Khmer recognition model
    rec_char_dict_path='khmer_char_dict.txt'
)

# This will detect text regions AND recognize them
result = ocr.ocr('full_document.jpg', cls=True)

๐Ÿ”„ Model Conversion

This model was exported from PaddlePaddle training format to inference format:

# Original export command used:
python tools/export_model.py \
    -c pretrainoutput/config.yml \
    -o Global.pretrained_model=pretrainoutput/best_accuracy.pdparams \
    Global.save_inference_dir=pretrainoutput/inference

๐Ÿ› ๏ธ Requirements

paddlepaddle>=2.4.0
opencv-python>=4.5.0
numpy>=1.19.0
pillow>=8.0.0
@misc{khmer-ocr-2025,
  title={Khmer OCR Recognition Model},
  author={[Your Name]},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/[your-username]/khmer-ocr}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support