Spaces:

mohakapoor
/

CaptchaOCR

Running

App Files Files Community

CaptchaOCR / README.md

mohakkapoor4

add Space metadata and entrypoint

85b848b 4 months ago

preview code

raw

history blame contribute delete

9.97 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: CaptchaOCR
emoji: 🔍
colorFrom: green
colorTo: gray
sdk: gradio
sdk_version: 5.43.1
app_file: app.py
pinned: false
short_description: CAPTCHA text recognition using CRNN neural networks

CAPTCHA OCR Project

A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.

🎯 Project Overview

This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:

Synthetic CAPTCHA generation for training data
CRNN (CNN + RNN) architecture for sequence recognition
CTC (Connectionist Temporal Classification) loss for training
PyTorch with CUDA support for GPU acceleration

🏗️ Current Status

✅ Completed Components

Dataset Generation: Synthetic CAPTCHA creation with train/val/test splits (8k train, 1k val)
Configuration: Centralized config with image dimensions and training parameters
Vocabulary System: Character encoding/decoding with CTC blank token support (63 classes)
CTC Collate Function: Proper batching for variable-length sequences
CTC Decoding: Greedy decode for inference
PyTorch Dataset Class: Image loading and preprocessing with proper cv2 resizing
CRNN Model: CNN encoder + BiLSTM + LayerNorm + linear output (working!)
Training Loop: Complete epoch-based training pipeline with validation
Metrics & Plotting: Training/validation loss tracking with beautiful visualizations
Debugging Tools: Comprehensive logging of logits, predictions, and model health

✅ What's Working

Training Pipeline: Stable training loop with excellent loss convergence
Model Architecture: CRNN produces correct output shapes (64×batch×63) with H=60, W=256
Data Loading: Proper image preprocessing and CTC batching
Full CAPTCHA Recognition: Model now recognizes complete CAPTCHA sequences
Inference Pipeline: Complete inference script with visualization and accuracy metrics
Early Stopping: Enhanced early stopping prevents overfitting automatically
High Accuracy: 75-100% overall accuracy, 25/26+ character accuracy (96%+)

🎯 Training Status

Current: Epoch 8, excellent convergence achieved
Best Model: Validation loss 0.1782, early stopping working perfectly
Performance: 75-100% accuracy on fresh CAPTCHAs (varies by run)

📊 Training Results

Key Insights:

Rapid convergence: Loss dropped from 21→0.1 in first 7 epochs
No overfitting: Enhanced early stopping prevents overfitting
Stable training: Val/Train ratio stays healthy throughout training

🔍 Inference Results

Model Performance:

Visual predictions: Shows actual CAPTCHA images with predicted text
High accuracy: 75-100% overall accuracy on fresh CAPTCHAs
Character-level precision: 96%+ character accuracy (25/26+ correct)

📁 Project Structure

CaptchaDetect/
├── Dataset/                 # Full dataset (100k images) - for Colab training
├── Dataset_test/           # Test dataset (1k images) - for local development
│   └── captchas/
│       ├── train/          # 80% of data
│       ├── val/            # 10% of data
│       └── test/           # 10% of data
├── src/
│   ├── config.py           # Configuration and hyperparameters
│   ├── vocab.py            # Character vocabulary and CTC encoding/decoding
│   ├── data.py             # Dataset generation script
│   ├── collate.py          # CTC batching function
│   ├── captcha_dataset.py  # PyTorch Dataset class
│   ├── model_crnn.py       # CRNN model architecture
│   └── plotting.py         # Training metrics and visualization
├── train.py                # Main training script (✅ WORKING!)
├── Metrics/                # Training plots and logs (auto-generated)
├── .gitignore              # Ignores dataset contents, keeps structure
└── README.md               # This file

🚀 Quick Start

1. Environment Setup

# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Install other dependencies
pip install captcha pandas pillow

2. Generate Training Dataset

cd src
python data.py

This creates 10,000 synthetic CAPTCHAs in Dataset_test/captchas/ with proper train/val/test splits.

3. Start Training

python train.py

This starts the full training pipeline with automatic metrics generation.

4. Monitor Progress

Training will show:

Real-time loss and prediction samples
Automatic plot generation in Metrics/ folder
Comprehensive training logs and summaries

🎮 Usage

Training

python train.py

Automatic early stopping prevents overfitting
Real-time metrics and sample predictions
Checkpoint saving for best model

Inference

python inference.py

Loads best trained model automatically
Generates test CAPTCHAs for evaluation
Shows both overall and character accuracy
Creates visualization plots in Metrics folder

🔬 Technical Details

Model Architecture (CRNN)

The model uses a CNN + RNN + CTC architecture specifically designed for sequence recognition:

graph TD
    %% Input Layer
    A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
    
    %% CNN Subgraph - Top to Bottom
    subgraph CNN ["CNN Encoder Layer"]
        B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2]
        C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128]
        
        D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2]
        E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64]
        
        F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection]
        G --> H[Maintains: 128 channels, 30x64 spatial]
        
        H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone]
        I --> J[Squeeze Height<br/>30x64 to 1x64]
        
        J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128]
    end
    
    %% RNN Subgraph - Top to Bottom
    subgraph RNN ["RNN Decoder Layer"]
        K --> L[RNN Decoder<br/>2-Layer BiLSTM]
        L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
        M --> N[Output: 64,B,640]
    end
    
    %% Output Subgraph - Top to Bottom
    subgraph OUTPUT ["Output & CTC Layer"]
        N --> O[LayerNorm<br/>Stabilize 640D features]
        O --> P[Linear Layer<br/>640 to 63 classes]
        P --> Q[Output Logits<br/>64,B,63]
        
        Q --> R[CTC Decoding<br/>Remove duplicates & blanks]
        R --> S[Final Prediction<br/>Character sequence]
    end
    
    %% Styling - Darker colors with black text
    classDef inputLayer fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000000
    classDef cnnLayer fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px,color:#000000
    classDef rnnLayer fill:#bbdefb,stroke:#1565c0,stroke-width:3px,color:#000000
    classDef outputLayer fill:#f8bbd9,stroke:#ad1457,stroke-width:3px,color:#000000
    classDef ctcLayer fill:#d1c4e9,stroke:#512da8,stroke-width:3px,color:#000000
    
    class A inputLayer
    class B,C,D,E,F,G,H,I,J,K cnnLayer
    class L,M,N rnnLayer
    class O,P,Q outputLayer
    class R,S ctcLayer

Key Design Features

Total Stride: 4 (256 → 64 timesteps)
Height Compression: 60 → 1 (via pooling)
Residual Connections: Prevents gradient vanishing
Bidirectional LSTM: Captures context from both directions
LayerNorm: Training stability before final classification

Training Optimizations

AdamW Optimizer: lr=3e-4, weight_decay=1e-4
Gradient Clipping: max_norm=1.0 prevents exploding gradients
Weight Initialization: Small uniform weights (-1e-3, 1e-3) for stability
Numeric Stability: AMP disabled during initial training for stability

CTC Training

Input: Images resized to 60×256 (height×width)
Output: Character sequences (a-z, A-Z, 0-9)
Loss: CTCLoss with blank=0, zero_infinity=True
Decoding: Greedy CTC decode with duplicate removal

Data Pipeline

Images: Grayscale, normalized to [0,1], proper cv2 resizing
Labels: CSV with filename and text label
Batching: Variable-length sequences with custom CTC collate function
Debugging: Real-time monitoring of logits, blank probability, predictions

📊 Performance Expectations

GTX 1650 (4GB VRAM)

Training time: 3-8 hours for 100k×40 epochs
Batch size: 32
Memory efficient with H=60, W=256

Tesla T4 (16GB VRAM)

Training time: 2-4 hours for 100k×40 epochs
Batch size: 128
Mixed precision (AMP) enabled

🛠️ Development Workflow

Implement Dataset class - Load and preprocess images
Build CRNN model - CNN + BiLSTM architecture
Create training loop - With validation and checkpoints
Add metrics - CER and accuracy tracking
Test on small dataset - Verify everything works
Scale to full dataset - Train on Colab

🤝 Contributing

This is a learning project! Feel free to:

Ask questions about implementation details
Experiment with different architectures
Improve the data generation or training pipeline

📚 Resources

📝 License

This project is for educational purposes. Feel free to use and modify as needed.

Happy coding! 🚀