Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: CaptchaOCR
emoji: ๐
colorFrom: green
colorTo: gray
sdk: gradio
sdk_version: 5.43.1
app_file: app.py
pinned: false
short_description: CAPTCHA text recognition using CRNN neural networks
CAPTCHA OCR Project
A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.
๐ฏ Project Overview
This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:
- Synthetic CAPTCHA generation for training data
- CRNN (CNN + RNN) architecture for sequence recognition
- CTC (Connectionist Temporal Classification) loss for training
- PyTorch with CUDA support for GPU acceleration
๐๏ธ Current Status
โ Completed Components
- Dataset Generation: Synthetic CAPTCHA creation with train/val/test splits (8k train, 1k val)
- Configuration: Centralized config with image dimensions and training parameters
- Vocabulary System: Character encoding/decoding with CTC blank token support (63 classes)
- CTC Collate Function: Proper batching for variable-length sequences
- CTC Decoding: Greedy decode for inference
- PyTorch Dataset Class: Image loading and preprocessing with proper cv2 resizing
- CRNN Model: CNN encoder + BiLSTM + LayerNorm + linear output (working!)
- Training Loop: Complete epoch-based training pipeline with validation
- Metrics & Plotting: Training/validation loss tracking with beautiful visualizations
- Debugging Tools: Comprehensive logging of logits, predictions, and model health
โ What's Working
- Training Pipeline: Stable training loop with excellent loss convergence
- Model Architecture: CRNN produces correct output shapes (64รbatchร63) with H=60, W=256
- Data Loading: Proper image preprocessing and CTC batching
- Full CAPTCHA Recognition: Model now recognizes complete CAPTCHA sequences
- Inference Pipeline: Complete inference script with visualization and accuracy metrics
- Early Stopping: Enhanced early stopping prevents overfitting automatically
- High Accuracy: 75-100% overall accuracy, 25/26+ character accuracy (96%+)
๐ฏ Training Status
- Current: Epoch 8, excellent convergence achieved
- Best Model: Validation loss 0.1782, early stopping working perfectly
- Performance: 75-100% accuracy on fresh CAPTCHAs (varies by run)
๐ Training Results
Key Insights:
- Rapid convergence: Loss dropped from 21โ0.1 in first 7 epochs
- No overfitting: Enhanced early stopping prevents overfitting
- Stable training: Val/Train ratio stays healthy throughout training
๐ Inference Results
Model Performance:
- Visual predictions: Shows actual CAPTCHA images with predicted text
- High accuracy: 75-100% overall accuracy on fresh CAPTCHAs
- Character-level precision: 96%+ character accuracy (25/26+ correct)
๐ Project Structure
CaptchaDetect/
โโโ Dataset/ # Full dataset (100k images) - for Colab training
โโโ Dataset_test/ # Test dataset (1k images) - for local development
โ โโโ captchas/
โ โโโ train/ # 80% of data
โ โโโ val/ # 10% of data
โ โโโ test/ # 10% of data
โโโ src/
โ โโโ config.py # Configuration and hyperparameters
โ โโโ vocab.py # Character vocabulary and CTC encoding/decoding
โ โโโ data.py # Dataset generation script
โ โโโ collate.py # CTC batching function
โ โโโ captcha_dataset.py # PyTorch Dataset class
โ โโโ model_crnn.py # CRNN model architecture
โ โโโ plotting.py # Training metrics and visualization
โโโ train.py # Main training script (โ
WORKING!)
โโโ Metrics/ # Training plots and logs (auto-generated)
โโโ .gitignore # Ignores dataset contents, keeps structure
โโโ README.md # This file
๐ Quick Start
1. Environment Setup
# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# Install other dependencies
pip install captcha pandas pillow
2. Generate Training Dataset
cd src
python data.py
This creates 10,000 synthetic CAPTCHAs in Dataset_test/captchas/ with proper train/val/test splits.
3. Start Training
python train.py
This starts the full training pipeline with automatic metrics generation.
4. Monitor Progress
Training will show:
- Real-time loss and prediction samples
- Automatic plot generation in
Metrics/folder - Comprehensive training logs and summaries
๐ฎ Usage
Training
python train.py
- Automatic early stopping prevents overfitting
- Real-time metrics and sample predictions
- Checkpoint saving for best model
Inference
python inference.py
- Loads best trained model automatically
- Generates test CAPTCHAs for evaluation
- Shows both overall and character accuracy
- Creates visualization plots in Metrics folder
๐ฌ Technical Details
Model Architecture (CRNN)
The model uses a CNN + RNN + CTC architecture specifically designed for sequence recognition:
graph TD
%% Input Layer
A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
%% CNN Subgraph - Top to Bottom
subgraph CNN ["CNN Encoder Layer"]
B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2]
C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128]
D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2]
E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64]
F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection]
G --> H[Maintains: 128 channels, 30x64 spatial]
H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone]
I --> J[Squeeze Height<br/>30x64 to 1x64]
J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128]
end
%% RNN Subgraph - Top to Bottom
subgraph RNN ["RNN Decoder Layer"]
K --> L[RNN Decoder<br/>2-Layer BiLSTM]
L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
M --> N[Output: 64,B,640]
end
%% Output Subgraph - Top to Bottom
subgraph OUTPUT ["Output & CTC Layer"]
N --> O[LayerNorm<br/>Stabilize 640D features]
O --> P[Linear Layer<br/>640 to 63 classes]
P --> Q[Output Logits<br/>64,B,63]
Q --> R[CTC Decoding<br/>Remove duplicates & blanks]
R --> S[Final Prediction<br/>Character sequence]
end
%% Styling - Darker colors with black text
classDef inputLayer fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000000
classDef cnnLayer fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px,color:#000000
classDef rnnLayer fill:#bbdefb,stroke:#1565c0,stroke-width:3px,color:#000000
classDef outputLayer fill:#f8bbd9,stroke:#ad1457,stroke-width:3px,color:#000000
classDef ctcLayer fill:#d1c4e9,stroke:#512da8,stroke-width:3px,color:#000000
class A inputLayer
class B,C,D,E,F,G,H,I,J,K cnnLayer
class L,M,N rnnLayer
class O,P,Q outputLayer
class R,S ctcLayer
Key Design Features
- Total Stride: 4 (256 โ 64 timesteps)
- Height Compression: 60 โ 1 (via pooling)
- Residual Connections: Prevents gradient vanishing
- Bidirectional LSTM: Captures context from both directions
- LayerNorm: Training stability before final classification
Training Optimizations
- AdamW Optimizer: lr=3e-4, weight_decay=1e-4
- Gradient Clipping: max_norm=1.0 prevents exploding gradients
- Weight Initialization: Small uniform weights (-1e-3, 1e-3) for stability
- Numeric Stability: AMP disabled during initial training for stability
CTC Training
- Input: Images resized to 60ร256 (heightรwidth)
- Output: Character sequences (a-z, A-Z, 0-9)
- Loss: CTCLoss with blank=0, zero_infinity=True
- Decoding: Greedy CTC decode with duplicate removal
Data Pipeline
- Images: Grayscale, normalized to [0,1], proper cv2 resizing
- Labels: CSV with filename and text label
- Batching: Variable-length sequences with custom CTC collate function
- Debugging: Real-time monitoring of logits, blank probability, predictions
๐ Performance Expectations
GTX 1650 (4GB VRAM)
- Training time: 3-8 hours for 100kร40 epochs
- Batch size: 32
- Memory efficient with H=60, W=256
Tesla T4 (16GB VRAM)
- Training time: 2-4 hours for 100kร40 epochs
- Batch size: 128
- Mixed precision (AMP) enabled
๐ ๏ธ Development Workflow
- Implement Dataset class - Load and preprocess images
- Build CRNN model - CNN + BiLSTM architecture
- Create training loop - With validation and checkpoints
- Add metrics - CER and accuracy tracking
- Test on small dataset - Verify everything works
- Scale to full dataset - Train on Colab
๐ค Contributing
This is a learning project! Feel free to:
- Ask questions about implementation details
- Experiment with different architectures
- Improve the data generation or training pipeline
๐ Resources
๐ License
This project is for educational purposes. Feel free to use and modify as needed.
Happy coding! ๐


