Gujarati OCR Model (SVTR-LCNet)
This model performs Optical Character Recognition (OCR) for Gujarati text using SVTR (Scene Text Recognition with a Single Visual model) architecture with MultiHead (CTC + SAR).
Model Description
- Architecture: SVTR-LCNet with MultiHead (CTC + SAR heads)
- Framework: PaddleOCR
- Language: Gujarati (gu)
- Input Size: [3, 48, 384] (C, H, W)
- Output: Gujarati text sequence
Training Details
Training Data
- Dataset: Gujarati text images
- Characters: 1030 unique Gujarati characters including:
- Gujarati consonants (ક, ખ, ગ, ઘ, etc.)
- Vowels (અ, આ, ઇ, ઈ, etc.)
- Matras (diacritical marks)
- Gujarati numerals (૦-૯)
- Special characters
Training Configuration
- Epochs: 120
- Best Epoch: 120
- Training Accuracy: 88.8%
- Norm Edit Distance: 0.977
- Optimizer: Adam with learning rate scheduling
- Image Shape: [3, 48, 384]
Training Results
| Metric | Value |
|---|---|
| Final Accuracy | 0.888 |
| Norm Edit Distance | 0.977 |
| Best Epoch | 120 |
| FPS (eval) | 1248.98 |
Usage
Prerequisites
pip install paddlepaddle-gpu paddleocr opencv-python numpy
Basic Usage
import paddle.inference as paddle_infer
import cv2
import numpy as np
import math
# Load model files (download from this repo)
config = paddle_infer.Config("inference.json", "inference.pdiparams")
config.enable_use_gpu(100, 0)
predictor = paddle_infer.create_predictor(config)
# Load character dictionary
with open(gu_dict.txt, r, encoding=utf-8) as f:
chars = [line.rstrip(
) for line in f.readlines() if line.strip(
)]
char_list = ["<blank>"] + chars + [" "]
# Preprocessing function (matches PaddleOCR training)
def preprocess(img):
imgC, imgH, imgW = 3, 48, 384
h, w = img.shape[:2]
ratio = w / float(h)
if math.ceil(imgH * ratio) > imgW:
resized_w = imgW
else:
resized_w = int(math.ceil(imgH * ratio))
resized_image = cv2.resize(img, (resized_w, imgH))
resized_image = resized_image.astype(float32)
# Normalize: (pixel / 255 - 0.5) / 0.5
resized_image = resized_image.transpose((2, 0, 1)) / 255.0
resized_image -= 0.5
resized_image /= 0.5
# Pad to fixed width
padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32)
padding_im[:, :, 0:resized_w] = resized_image
return np.expand_dims(padding_im, axis=0)
# CTC Decoding
def ctc_decode(indices, char_list):
return "".join([char_list[idx] for i, idx in enumerate(indices)
if idx != 0 and (i == 0 or idx != indices[i-1]) and idx < len(char_list)])
# Run inference
img = cv2.imread(your_gujarati_text.jpg)
input_tensor = preprocess(img)
input_handle = predictor.get_input_handle(x)
input_handle.copy_from_cpu(input_tensor)
predictor.run()
logits = predictor.get_output_handle(fetch_name_0).copy_to_cpu()
indices = np.argmax(logits, axis=2)[0]
# Decode text
gujarati_text = ctc_decode(indices, char_list)
print(gujarati_text)
With Text Detection (Full OCR Pipeline)
from paddlex import create_model
# Load detection model
det_model = create_model("PP-OCRv5_server_det")
# Detect text regions
img = cv2.imread(document.jpg)
det_result = list(det_model.predict(img))[0]
boxes = det_result.get(dt_polys, [])
# Process each detected region
for box in boxes:
# Crop text region
# ... (crop and rotate box)
# Recognize text
input_tensor = preprocess(cropped_region)
# ... (run inference as shown above)
Model Performance
Strengths
- High accuracy (88.8%) on single-word Gujarati text
- Fast inference speed (1248 FPS on GPU)
- Supports all Gujarati characters and diacritics
- Works well with clean, printed text
Limitations
- Best for single-word text: Trained on individual words, not full sentences
- Printed text only: May not work well on handwritten text
- Similar to training data: Performance degrades on significantly different image styles
- Clean images: Works best on high-contrast, clear text images
Use Cases
✅ Good for:
- Digitizing printed Gujarati documents
- Single-word Gujarati text recognition
- Clean text images with good contrast
- Gujarati text in books, signs, labels
❌ Not recommended for:
- Handwritten Gujarati text
- Low-quality or blurry images
- Complex document layouts without proper text detection
- Significantly different fonts/styles from training data
Example Results
Input: Clean Gujarati word image
Output: ફાઇબર (fiber)
Accuracy: ✅ Perfect match
Technical Details
Architecture
- Backbone: LCNet (Lightweight CNN)
- Neck: SVTR (Scene Text Recognition Transformer)
- Head: MultiHead (CTC + SAR)
- CTC: Connectionist Temporal Classification
- SAR: Show, Attend and Read
Input/Output
- Input Shape: (batch_size, 3, 48, 384)
- Input Range: [-1.0, 1.0] (normalized)
- Output Shape: (batch_size, 48, 1032)
- Output: Logits for 1032 classes (blank + 1030 chars + space)
Citation
If you use this model, please cite PaddleOCR:
@misc{paddleocr,
title={PaddleOCR: Awesome multilingual OCR toolkits},
author={PaddlePaddle Authors},
howpublished = {\url{https://github.com/PaddlePaddle/PaddleOCR}},
year={2020}
}
License
Apache License 2.0
Contact & Support
For issues or questions:
- Open an issue on the PaddleOCR GitHub
- Check PaddleOCR Documentation
Acknowledgments
This model was trained using the PaddleOCR framework. Special thanks to the PaddlePaddle team for their excellent OCR toolkit.