Gujarati OCR Model (SVTR-LCNet)

This model performs Optical Character Recognition (OCR) for Gujarati text using SVTR (Scene Text Recognition with a Single Visual model) architecture with MultiHead (CTC + SAR).

Model Description

Architecture: SVTR-LCNet with MultiHead (CTC + SAR heads)
Framework: PaddleOCR
Language: Gujarati (gu)
Input Size: [3, 48, 384] (C, H, W)
Output: Gujarati text sequence

Training Details

Training Data

Dataset: Gujarati text images
Characters: 1030 unique Gujarati characters including:
- Gujarati consonants (ક, ખ, ગ, ઘ, etc.)
- Vowels (અ, આ, ઇ, ઈ, etc.)
- Matras (diacritical marks)
- Gujarati numerals (૦-૯)
- Special characters

Training Configuration

Epochs: 120
Best Epoch: 120
Training Accuracy: 88.8%
Norm Edit Distance: 0.977
Optimizer: Adam with learning rate scheduling
Image Shape: [3, 48, 384]

Training Results

Metric	Value
Final Accuracy	0.888
Norm Edit Distance	0.977
Best Epoch	120
FPS (eval)	1248.98

Usage

Prerequisites

pip install paddlepaddle-gpu paddleocr opencv-python numpy

Basic Usage

import paddle.inference as paddle_infer
import cv2
import numpy as np
import math

# Load model files (download from this repo)
config = paddle_infer.Config("inference.json", "inference.pdiparams")
config.enable_use_gpu(100, 0)
predictor = paddle_infer.create_predictor(config)

# Load character dictionary
with open(gu_dict.txt, r, encoding=utf-8) as f:
    chars = [line.rstrip(
) for line in f.readlines() if line.strip(
)]
char_list = ["<blank>"] + chars + [" "]

# Preprocessing function (matches PaddleOCR training)
def preprocess(img):
    imgC, imgH, imgW = 3, 48, 384
    h, w = img.shape[:2]
    ratio = w / float(h)
    
    if math.ceil(imgH * ratio) > imgW:
        resized_w = imgW
    else:
        resized_w = int(math.ceil(imgH * ratio))
    
    resized_image = cv2.resize(img, (resized_w, imgH))
    resized_image = resized_image.astype(float32)
    
    # Normalize: (pixel / 255 - 0.5) / 0.5
    resized_image = resized_image.transpose((2, 0, 1)) / 255.0
    resized_image -= 0.5
    resized_image /= 0.5
    
    # Pad to fixed width
    padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32)
    padding_im[:, :, 0:resized_w] = resized_image
    
    return np.expand_dims(padding_im, axis=0)

# CTC Decoding
def ctc_decode(indices, char_list):
    return "".join([char_list[idx] for i, idx in enumerate(indices) 
                   if idx != 0 and (i == 0 or idx != indices[i-1]) and idx < len(char_list)])

# Run inference
img = cv2.imread(your_gujarati_text.jpg)
input_tensor = preprocess(img)

input_handle = predictor.get_input_handle(x)
input_handle.copy_from_cpu(input_tensor)
predictor.run()

logits = predictor.get_output_handle(fetch_name_0).copy_to_cpu()
indices = np.argmax(logits, axis=2)[0]

# Decode text
gujarati_text = ctc_decode(indices, char_list)
print(gujarati_text)

With Text Detection (Full OCR Pipeline)

from paddlex import create_model

# Load detection model
det_model = create_model("PP-OCRv5_server_det")

# Detect text regions
img = cv2.imread(document.jpg)
det_result = list(det_model.predict(img))[0]
boxes = det_result.get(dt_polys, [])

# Process each detected region
for box in boxes:
    # Crop text region
    # ... (crop and rotate box)
    
    # Recognize text
    input_tensor = preprocess(cropped_region)
    # ... (run inference as shown above)

Model Performance

Strengths

High accuracy (88.8%) on single-word Gujarati text
Fast inference speed (1248 FPS on GPU)
Supports all Gujarati characters and diacritics
Works well with clean, printed text

Limitations

Best for single-word text: Trained on individual words, not full sentences
Printed text only: May not work well on handwritten text
Similar to training data: Performance degrades on significantly different image styles
Clean images: Works best on high-contrast, clear text images

Use Cases

✅ Good for:

Digitizing printed Gujarati documents
Single-word Gujarati text recognition
Clean text images with good contrast
Gujarati text in books, signs, labels

❌ Not recommended for:

Handwritten Gujarati text
Low-quality or blurry images
Complex document layouts without proper text detection
Significantly different fonts/styles from training data

Example Results

Input: Clean Gujarati word image
Output: ફાઇબર (fiber)
Accuracy: ✅ Perfect match

Technical Details

Architecture

Backbone: LCNet (Lightweight CNN)
Neck: SVTR (Scene Text Recognition Transformer)
Head: MultiHead (CTC + SAR)
- CTC: Connectionist Temporal Classification
- SAR: Show, Attend and Read

Input/Output

Input Shape: (batch_size, 3, 48, 384)
Input Range: [-1.0, 1.0] (normalized)
Output Shape: (batch_size, 48, 1032)
Output: Logits for 1032 classes (blank + 1030 chars + space)

Citation

If you use this model, please cite PaddleOCR:

@misc{paddleocr,
    title={PaddleOCR: Awesome multilingual OCR toolkits},
    author={PaddlePaddle Authors},
    howpublished = {\url{https://github.com/PaddlePaddle/PaddleOCR}},
    year={2020}
}

License

Apache License 2.0

Contact & Support

For issues or questions:

Open an issue on the PaddleOCR GitHub
Check PaddleOCR Documentation

Acknowledgments

This model was trained using the PaddleOCR framework. Special thanks to the PaddlePaddle team for their excellent OCR toolkit.

Downloads last month: -; Downloads are not tracked for this model. How to track