ResNet18-CRNN Handwritten Text Recognition

This repository hosts a lightweight, high-performance Convolutional Recurrent Neural Network (CRNN) using a ResNet18 backbone for sequence-based handwritten text recognition (OCR). The model is optimized for processing line-level handwriting imagery and maps visual patterns directly to text strings using Connectionist Temporal Classification (CTC) decoding.

Model Architecture Details

The model leverages a custom CNN-RNN hybrid architecture designed for sequence extraction:

Backbone: ResNet18 (ImageNet pre-trained).
- Modification: To preserve spatial sequences, downsampling in layer3 and layer4 was modified (strides adjusted to (2, 1) and (1, 1)).
- Freezing: conv1, bn1, layer1, and layer2 were frozen during training to retain low-level feature extraction capabilities.
Sequence Mapping: A custom SeqToMap linear projection compresses the feature maps to 256 dimensions.
Recurrent Layers: A 2-layer Bidirectional LSTM (Hidden Size: 256, Dropout: 0.5) captures contextual character dependencies.
Classifier: Fully connected layer outputting 80 unique character classes (plus the CTC blank token).

Training Configuration & Hyperparameters

The model was trained utilizing a highly optimized pipeline:

Hardware: Multi-GPU supported setup.
Epochs: 40
Batch Size: 128
Optimizer: Adam
Learning Rate: Initial 0.0006, utilizing a CosineAnnealingLR scheduler decaying to a minimum of 0.000003.
Loss Function: CTCLoss (zero_infinity=True).
Input Resolution: Images scaled dynamically in width, with a fixed height of 96px.

Data Augmentation Pipeline

To ensure robust generalization against varying handwriting styles and scan qualities, the following augmentations were applied during training:

Random Affine: Slight rotations (±2°), translations (2%), and scaling (95%-105%).
Color Jitter: Brightness and contrast variations (±20%).
Gaussian Blur: Applied with a 30% probability to simulate out-of-focus scans.
Normalization: Standard ImageNet mean and standard deviation.

Training Metrics & Performance

The model shows robust generalization and steep error reduction over the 40-epoch training horizon.

Evaluation Summary

Metric	Final Value
Train Loss	0.2461
Test Loss	0.4371
Train Character Error Rate (CER)	6.67%
Test Character Error Rate (CER)	11.48%

Convergence Curves

Loss Performance

Character Error Rate (CER) Convergence

Qualitative Sample Predictions

The following visual samples display the original target text lines along with their respective ground truth labels, predicted text, and calculated Character Error Rates (CER).