CaptchaOCR / README.md
mohakkapoor4
add Space metadata and entrypoint
85b848b
---
title: CaptchaOCR
emoji: ๐Ÿ”
colorFrom: green
colorTo: gray
sdk: gradio
sdk_version: 5.43.1
app_file: app.py
pinned: false
short_description: CAPTCHA text recognition using CRNN neural networks
---
# CAPTCHA OCR Project
A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.
## ๐ŸŽฏ Project Overview
This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:
- **Synthetic CAPTCHA generation** for training data
- **CRNN (CNN + RNN) architecture** for sequence recognition
- **CTC (Connectionist Temporal Classification)** loss for training
- **PyTorch** with CUDA support for GPU acceleration
## ๐Ÿ—๏ธ Current Status
### โœ… Completed Components
- **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits (8k train, 1k val)
- **Configuration**: Centralized config with image dimensions and training parameters
- **Vocabulary System**: Character encoding/decoding with CTC blank token support (63 classes)
- **CTC Collate Function**: Proper batching for variable-length sequences
- **CTC Decoding**: Greedy decode for inference
- **PyTorch Dataset Class**: Image loading and preprocessing with proper cv2 resizing
- **CRNN Model**: CNN encoder + BiLSTM + LayerNorm + linear output (working!)
- **Training Loop**: Complete epoch-based training pipeline with validation
- **Metrics & Plotting**: Training/validation loss tracking with beautiful visualizations
- **Debugging Tools**: Comprehensive logging of logits, predictions, and model health
### โœ… What's Working
- **Training Pipeline**: Stable training loop with excellent loss convergence
- **Model Architecture**: CRNN produces correct output shapes (64ร—batchร—63) with H=60, W=256
- **Data Loading**: Proper image preprocessing and CTC batching
- **Full CAPTCHA Recognition**: Model now recognizes complete CAPTCHA sequences
- **Inference Pipeline**: Complete inference script with visualization and accuracy metrics
- **Early Stopping**: Enhanced early stopping prevents overfitting automatically
- **High Accuracy**: 75-100% overall accuracy, 25/26+ character accuracy (96%+)
### ๐ŸŽฏ Training Status
- **Current**: Epoch 8, excellent convergence achieved
- **Best Model**: Validation loss 0.1782, early stopping working perfectly
- **Performance**: 75-100% accuracy on fresh CAPTCHAs (varies by run)
### ๐Ÿ“Š Training Results
![Training Losses](Metrics/training_losses.png)
![Loss Comparison](Metrics/loss_comparison.png)
**Key Insights:**
- **Rapid convergence**: Loss dropped from 21โ†’0.1 in first 7 epochs
- **No overfitting**: Enhanced early stopping prevents overfitting
- **Stable training**: Val/Train ratio stays healthy throughout training
### ๐Ÿ” Inference Results
![Inference Results](Metrics/inference_results_readme.png)
**Model Performance:**
- **Visual predictions**: Shows actual CAPTCHA images with predicted text
- **High accuracy**: 75-100% overall accuracy on fresh CAPTCHAs
- **Character-level precision**: 96%+ character accuracy (25/26+ correct)
## ๐Ÿ“ Project Structure
```
CaptchaDetect/
โ”œโ”€โ”€ Dataset/ # Full dataset (100k images) - for Colab training
โ”œโ”€โ”€ Dataset_test/ # Test dataset (1k images) - for local development
โ”‚ โ””โ”€โ”€ captchas/
โ”‚ โ”œโ”€โ”€ train/ # 80% of data
โ”‚ โ”œโ”€โ”€ val/ # 10% of data
โ”‚ โ””โ”€โ”€ test/ # 10% of data
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ config.py # Configuration and hyperparameters
โ”‚ โ”œโ”€โ”€ vocab.py # Character vocabulary and CTC encoding/decoding
โ”‚ โ”œโ”€โ”€ data.py # Dataset generation script
โ”‚ โ”œโ”€โ”€ collate.py # CTC batching function
โ”‚ โ”œโ”€โ”€ captcha_dataset.py # PyTorch Dataset class
โ”‚ โ”œโ”€โ”€ model_crnn.py # CRNN model architecture
โ”‚ โ””โ”€โ”€ plotting.py # Training metrics and visualization
โ”œโ”€โ”€ train.py # Main training script (โœ… WORKING!)
โ”œโ”€โ”€ Metrics/ # Training plots and logs (auto-generated)
โ”œโ”€โ”€ .gitignore # Ignores dataset contents, keeps structure
โ””โ”€โ”€ README.md # This file
```
## ๐Ÿš€ Quick Start
### 1. Environment Setup
```bash
# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# Install other dependencies
pip install captcha pandas pillow
```
### 2. Generate Training Dataset
```bash
cd src
python data.py
```
This creates 10,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits.
### 3. Start Training
```bash
python train.py
```
This starts the full training pipeline with automatic metrics generation.
### 4. Monitor Progress
Training will show:
- Real-time loss and prediction samples
- Automatic plot generation in `Metrics/` folder
- Comprehensive training logs and summaries
## ๐ŸŽฎ Usage
### Training
```bash
python train.py
```
- **Automatic early stopping** prevents overfitting
- **Real-time metrics** and sample predictions
- **Checkpoint saving** for best model
### Inference
```bash
python inference.py
```
- **Loads best trained model** automatically
- **Generates test CAPTCHAs** for evaluation
- **Shows both overall and character accuracy**
- **Creates visualization plots** in Metrics folder
## ๐Ÿ”ฌ Technical Details
### Model Architecture (CRNN)
The model uses a **CNN + RNN + CTC** architecture specifically designed for sequence recognition:
```mermaid
graph TD
%% Input Layer
A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
%% CNN Subgraph - Top to Bottom
subgraph CNN ["CNN Encoder Layer"]
B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2]
C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128]
D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2]
E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64]
F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection]
G --> H[Maintains: 128 channels, 30x64 spatial]
H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone]
I --> J[Squeeze Height<br/>30x64 to 1x64]
J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128]
end
%% RNN Subgraph - Top to Bottom
subgraph RNN ["RNN Decoder Layer"]
K --> L[RNN Decoder<br/>2-Layer BiLSTM]
L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
M --> N[Output: 64,B,640]
end
%% Output Subgraph - Top to Bottom
subgraph OUTPUT ["Output & CTC Layer"]
N --> O[LayerNorm<br/>Stabilize 640D features]
O --> P[Linear Layer<br/>640 to 63 classes]
P --> Q[Output Logits<br/>64,B,63]
Q --> R[CTC Decoding<br/>Remove duplicates & blanks]
R --> S[Final Prediction<br/>Character sequence]
end
%% Styling - Darker colors with black text
classDef inputLayer fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000000
classDef cnnLayer fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px,color:#000000
classDef rnnLayer fill:#bbdefb,stroke:#1565c0,stroke-width:3px,color:#000000
classDef outputLayer fill:#f8bbd9,stroke:#ad1457,stroke-width:3px,color:#000000
classDef ctcLayer fill:#d1c4e9,stroke:#512da8,stroke-width:3px,color:#000000
class A inputLayer
class B,C,D,E,F,G,H,I,J,K cnnLayer
class L,M,N rnnLayer
class O,P,Q outputLayer
class R,S ctcLayer
```
#### **Key Design Features**
- **Total Stride**: 4 (256 โ†’ 64 timesteps)
- **Height Compression**: 60 โ†’ 1 (via pooling)
- **Residual Connections**: Prevents gradient vanishing
- **Bidirectional LSTM**: Captures context from both directions
- **LayerNorm**: Training stability before final classification
### Training Optimizations
- **AdamW Optimizer**: lr=3e-4, weight_decay=1e-4
- **Gradient Clipping**: max_norm=1.0 prevents exploding gradients
- **Weight Initialization**: Small uniform weights (-1e-3, 1e-3) for stability
- **Numeric Stability**: AMP disabled during initial training for stability
### CTC Training
- **Input**: Images resized to 60ร—256 (heightร—width)
- **Output**: Character sequences (a-z, A-Z, 0-9)
- **Loss**: CTCLoss with blank=0, zero_infinity=True
- **Decoding**: Greedy CTC decode with duplicate removal
### Data Pipeline
- **Images**: Grayscale, normalized to [0,1], proper cv2 resizing
- **Labels**: CSV with filename and text label
- **Batching**: Variable-length sequences with custom CTC collate function
- **Debugging**: Real-time monitoring of logits, blank probability, predictions
## ๐Ÿ“Š Performance Expectations
### GTX 1650 (4GB VRAM)
- Training time: 3-8 hours for 100kร—40 epochs
- Batch size: 32
- Memory efficient with H=60, W=256
### Tesla T4 (16GB VRAM)
- Training time: 2-4 hours for 100kร—40 epochs
- Batch size: 128
- Mixed precision (AMP) enabled
## ๐Ÿ› ๏ธ Development Workflow
1. **Implement Dataset class** - Load and preprocess images
2. **Build CRNN model** - CNN + BiLSTM architecture
3. **Create training loop** - With validation and checkpoints
4. **Add metrics** - CER and accuracy tracking
5. **Test on small dataset** - Verify everything works
6. **Scale to full dataset** - Train on Colab
## ๐Ÿค Contributing
This is a learning project! Feel free to:
- Ask questions about implementation details
- Experiment with different architectures
- Improve the data generation or training pipeline
## ๐Ÿ“š Resources
- [CTC Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
- [CRNN Architecture](https://arxiv.org/abs/1507.05717)
- [PyTorch CTC Tutorial](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html)
## ๐Ÿ“ License
This project is for educational purposes. Feel free to use and modify as needed.
---
**Happy coding! ๐Ÿš€**