ResNet18-CRNN Handwritten Text Recognition
This repository hosts a lightweight, high-performance Convolutional Recurrent Neural Network (CRNN) using a ResNet18 backbone for sequence-based handwritten text recognition (OCR). The model is optimized for processing line-level handwriting imagery and maps visual patterns directly to text strings using Connectionist Temporal Classification (CTC) decoding.
Model Architecture Details
The model leverages a custom CNN-RNN hybrid architecture designed for sequence extraction:
- Backbone: ResNet18 (ImageNet pre-trained).
- Modification: To preserve spatial sequences, downsampling in
layer3andlayer4was modified (strides adjusted to(2, 1)and(1, 1)). - Freezing:
conv1,bn1,layer1, andlayer2were frozen during training to retain low-level feature extraction capabilities.
- Modification: To preserve spatial sequences, downsampling in
- Sequence Mapping: A custom
SeqToMaplinear projection compresses the feature maps to 256 dimensions. - Recurrent Layers: A 2-layer Bidirectional LSTM (Hidden Size: 256, Dropout: 0.5) captures contextual character dependencies.
- Classifier: Fully connected layer outputting 80 unique character classes (plus the CTC blank token).
Training Configuration & Hyperparameters
The model was trained utilizing a highly optimized pipeline:
- Hardware: Multi-GPU supported setup.
- Epochs: 40
- Batch Size: 128
- Optimizer: Adam
- Learning Rate: Initial
0.0006, utilizing aCosineAnnealingLRscheduler decaying to a minimum of0.000003. - Loss Function:
CTCLoss(zero_infinity=True). - Input Resolution: Images scaled dynamically in width, with a fixed height of
96px.
Data Augmentation Pipeline
To ensure robust generalization against varying handwriting styles and scan qualities, the following augmentations were applied during training:
- Random Affine: Slight rotations (±2°), translations (2%), and scaling (95%-105%).
- Color Jitter: Brightness and contrast variations (±20%).
- Gaussian Blur: Applied with a 30% probability to simulate out-of-focus scans.
- Normalization: Standard ImageNet mean and standard deviation.
Training Metrics & Performance
The model shows robust generalization and steep error reduction over the 40-epoch training horizon.
Evaluation Summary
| Metric | Final Value |
|---|---|
| Train Loss | 0.2461 |
| Test Loss | 0.4371 |
| Train Character Error Rate (CER) | 6.67% |
| Test Character Error Rate (CER) | 11.48% |
Convergence Curves
Loss Performance
Character Error Rate (CER) Convergence
Qualitative Sample Predictions
The following visual samples display the original target text lines along with their respective ground truth labels, predicted text, and calculated Character Error Rates (CER).





