Upload folder using huggingface_hub

37795b9 verified 3 months ago

5.04 kB

	# Khmer OCR Recognition Model

	🇰🇭 High-accuracy OCR model for Khmer text recognition using PaddleOCR framework

	## Model Overview

	This CRNN-based OCR model is specifically trained for Khmer (Cambodian) text recognition, achieving 98.45% accuracy on validation data. The model is optimized for recognizing short text segments (3-5 words) commonly found in documents, signs, and printed materials.

	## 🏗️ Model Architecture

	- Framework: PaddleOCR 2.7+
	- Algorithm: CRNN (Convolutional Recurrent Neural Network)
	- Backbone: ResNet34
	- Neck: SequenceEncoder with RNN (hidden_size: 256)
	- Head: CTCHead with CTC Loss
	- Input Shape: `[3, 32, 320]` (channels, height, width)
	- Max Text Length: 25 characters

	## 📝 Supported Characters

	The model recognizes 188 characters including:

	- Khmer Consonants: ក ខ គ ឃ ង ច ឆ ជ ឈ ញ ដ ឋ ឌ ឍ ណ ត ថ ទ ធ ន ប ផ ព ភ ម យ រ ល វ ស ហ ឡ អ
	- Khmer Vowels: ា ិ ី ឹ ឺ ុ ូ ួ ើ ឿ ៀ េ ែ ៃ ោ ៅ ំ ះ ៈ
	- Khmer Numerals: ០ ១ ២ ៣ ៤ ៥ ៦ ៧ ៨ ៩
	- Latin Characters: A-Z, a-z, 0-9
	- Punctuation: . , ! ? - ( ) [ ] « » ™ ® etc.
	- Khmer Symbols: ។ ៕ ៖ ៗ ៉ ៊ ់ ៌ ៍ ៏ ័ ្

	## 🚀 Quick Start

	### Installation

	```bash
	pip install paddlepaddle paddleocr opencv-python
	```

	### Basic Usage

	```python
	from paddleocr import PaddleOCR
	import cv2

	# Initialize OCR with custom Khmer model
	ocr = PaddleOCR(
	use_angle_cls=True,
	lang='ch', # Use Chinese as base language
	rec_model_dir='path/to/model', # Directory containing inference files
	rec_char_dict_path='khmer_char_dict.txt',
	show_log=False
	)

	# Process image
	result = ocr.ocr('khmer_text_image.jpg', cls=True)

	# Extract results
	for idx in range(len(result)):
	res = result[idx]
	if res is None:
	continue
	for line in res:
	text = line[1][0] # Recognized text
	confidence = line[1][1] # Confidence score
	print(f'Text: {text}, Confidence: {confidence:.3f}')
	```

	### Command Line Usage

	```bash
	# Download model files to a directory
	# Then use PaddleOCR tools:

	python tools/infer/predict_rec.py \
	--image_dir="your_khmer_image.png" \
	--rec_model_dir="path/to/model" \
	--rec_char_dict_path="khmer_char_dict.txt"
	```

	## 📁 Files Included

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `inference.pdiparams` \| ~106MB \| Main model weights \|
	\| `inference.yml` \| ~2KB \| Model configuration \|
	\| `inference.json` \| ~1KB \| Model metadata \|
	\| `khmer_char_dict.txt` \| ~2KB \| Character dictionary (188 characters) \|
	\| `training_config.yml` \| ~2KB \| Original training configuration \|

	## 🔧 Training Details

	### Dataset Characteristics
	- Text Length: 3-5 words per image (optimized for short segments)
	- Image Size: 600×80 pixels (training), resized to 320×32 for inference
	- Font: KhmerOS TTF
	- Background: White background with black text
	- Augmentation: Clean, blurred, noisy, and noise+blur variants

	### Training Configuration
	- Epochs: 30 (best model at epoch 29)
	- Optimizer: Adam with β₁=0.9, β₂=0.999
	- Learning Rate: Cosine scheduling (initial: 0.001)
	- Batch Size: 32
	- Loss Function: CTC Loss
	- Regularization: L2 (factor: 4e-05)

	## 💡 Usage Tips

	### Best Practices
	1. Image Quality: Use high-contrast images with clear text
	2. Text Length: Optimal for 3-5 word segments (model's training focus)
	3. Resolution: Images should be reasonably sized (not too small)
	4. Preprocessing: Consider using text detection for full documents

	### For Long Text Documents
	Since this model is optimized for short segments, for full documents:

	1. Use Text Detection: Combine with PaddleOCR's detection model
	2. Segment Text: Break long lines into 3-5 word chunks
	3. Post-process: Combine results from multiple segments

	```python
	# Example for full document processing
	ocr = PaddleOCR(
	use_angle_cls=True,
	lang='ch',
	det_model_dir='path/to/detection/model', # Add detection model
	rec_model_dir='path/to/this/model', # This Khmer recognition model
	rec_char_dict_path='khmer_char_dict.txt'
	)

	# This will detect text regions AND recognize them
	result = ocr.ocr('full_document.jpg', cls=True)
	```


	## 🔄 Model Conversion

	This model was exported from PaddlePaddle training format to inference format:

	```bash
	# Original export command used:
	python tools/export_model.py \
	-c pretrainoutput/config.yml \
	-o Global.pretrained_model=pretrainoutput/best_accuracy.pdparams \
	Global.save_inference_dir=pretrainoutput/inference
	```

	## 🛠️ Requirements

	```
	paddlepaddle>=2.4.0
	opencv-python>=4.5.0
	numpy>=1.19.0
	pillow>=8.0.0
	```

	```bibtex
	@misc{khmer-ocr-2025,
	title={Khmer OCR Recognition Model},
	author={[Your Name]},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/[your-username]/khmer-ocr}}
	}
	```