CaptchaOCR / README.md
mohakkapoor4
add Space metadata and entrypoint
85b848b

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: CaptchaOCR
emoji: ๐Ÿ”
colorFrom: green
colorTo: gray
sdk: gradio
sdk_version: 5.43.1
app_file: app.py
pinned: false
short_description: CAPTCHA text recognition using CRNN neural networks

CAPTCHA OCR Project

A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.

๐ŸŽฏ Project Overview

This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:

  • Synthetic CAPTCHA generation for training data
  • CRNN (CNN + RNN) architecture for sequence recognition
  • CTC (Connectionist Temporal Classification) loss for training
  • PyTorch with CUDA support for GPU acceleration

๐Ÿ—๏ธ Current Status

โœ… Completed Components

  • Dataset Generation: Synthetic CAPTCHA creation with train/val/test splits (8k train, 1k val)
  • Configuration: Centralized config with image dimensions and training parameters
  • Vocabulary System: Character encoding/decoding with CTC blank token support (63 classes)
  • CTC Collate Function: Proper batching for variable-length sequences
  • CTC Decoding: Greedy decode for inference
  • PyTorch Dataset Class: Image loading and preprocessing with proper cv2 resizing
  • CRNN Model: CNN encoder + BiLSTM + LayerNorm + linear output (working!)
  • Training Loop: Complete epoch-based training pipeline with validation
  • Metrics & Plotting: Training/validation loss tracking with beautiful visualizations
  • Debugging Tools: Comprehensive logging of logits, predictions, and model health

โœ… What's Working

  • Training Pipeline: Stable training loop with excellent loss convergence
  • Model Architecture: CRNN produces correct output shapes (64ร—batchร—63) with H=60, W=256
  • Data Loading: Proper image preprocessing and CTC batching
  • Full CAPTCHA Recognition: Model now recognizes complete CAPTCHA sequences
  • Inference Pipeline: Complete inference script with visualization and accuracy metrics
  • Early Stopping: Enhanced early stopping prevents overfitting automatically
  • High Accuracy: 75-100% overall accuracy, 25/26+ character accuracy (96%+)

๐ŸŽฏ Training Status

  • Current: Epoch 8, excellent convergence achieved
  • Best Model: Validation loss 0.1782, early stopping working perfectly
  • Performance: 75-100% accuracy on fresh CAPTCHAs (varies by run)

๐Ÿ“Š Training Results

Training Losses Loss Comparison

Key Insights:

  • Rapid convergence: Loss dropped from 21โ†’0.1 in first 7 epochs
  • No overfitting: Enhanced early stopping prevents overfitting
  • Stable training: Val/Train ratio stays healthy throughout training

๐Ÿ” Inference Results

Inference Results

Model Performance:

  • Visual predictions: Shows actual CAPTCHA images with predicted text
  • High accuracy: 75-100% overall accuracy on fresh CAPTCHAs
  • Character-level precision: 96%+ character accuracy (25/26+ correct)

๐Ÿ“ Project Structure

CaptchaDetect/
โ”œโ”€โ”€ Dataset/                 # Full dataset (100k images) - for Colab training
โ”œโ”€โ”€ Dataset_test/           # Test dataset (1k images) - for local development
โ”‚   โ””โ”€โ”€ captchas/
โ”‚       โ”œโ”€โ”€ train/          # 80% of data
โ”‚       โ”œโ”€โ”€ val/            # 10% of data
โ”‚       โ””โ”€โ”€ test/           # 10% of data
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ config.py           # Configuration and hyperparameters
โ”‚   โ”œโ”€โ”€ vocab.py            # Character vocabulary and CTC encoding/decoding
โ”‚   โ”œโ”€โ”€ data.py             # Dataset generation script
โ”‚   โ”œโ”€โ”€ collate.py          # CTC batching function
โ”‚   โ”œโ”€โ”€ captcha_dataset.py  # PyTorch Dataset class
โ”‚   โ”œโ”€โ”€ model_crnn.py       # CRNN model architecture
โ”‚   โ””โ”€โ”€ plotting.py         # Training metrics and visualization
โ”œโ”€โ”€ train.py                # Main training script (โœ… WORKING!)
โ”œโ”€โ”€ Metrics/                # Training plots and logs (auto-generated)
โ”œโ”€โ”€ .gitignore              # Ignores dataset contents, keeps structure
โ””โ”€โ”€ README.md               # This file

๐Ÿš€ Quick Start

1. Environment Setup

# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Install other dependencies
pip install captcha pandas pillow

2. Generate Training Dataset

cd src
python data.py

This creates 10,000 synthetic CAPTCHAs in Dataset_test/captchas/ with proper train/val/test splits.

3. Start Training

python train.py

This starts the full training pipeline with automatic metrics generation.

4. Monitor Progress

Training will show:

  • Real-time loss and prediction samples
  • Automatic plot generation in Metrics/ folder
  • Comprehensive training logs and summaries

๐ŸŽฎ Usage

Training

python train.py
  • Automatic early stopping prevents overfitting
  • Real-time metrics and sample predictions
  • Checkpoint saving for best model

Inference

python inference.py
  • Loads best trained model automatically
  • Generates test CAPTCHAs for evaluation
  • Shows both overall and character accuracy
  • Creates visualization plots in Metrics folder

๐Ÿ”ฌ Technical Details

Model Architecture (CRNN)

The model uses a CNN + RNN + CTC architecture specifically designed for sequence recognition:

graph TD
    %% Input Layer
    A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN]
    
    %% CNN Subgraph - Top to Bottom
    subgraph CNN ["CNN Encoder Layer"]
        B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2]
        C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128]
        
        D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2]
        E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64]
        
        F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection]
        G --> H[Maintains: 128 channels, 30x64 spatial]
        
        H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone]
        I --> J[Squeeze Height<br/>30x64 to 1x64]
        
        J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128]
    end
    
    %% RNN Subgraph - Top to Bottom
    subgraph RNN ["RNN Decoder Layer"]
        K --> L[RNN Decoder<br/>2-Layer BiLSTM]
        L --> M[Hidden Size: 320 per direction<br/>Total: 640 features]
        M --> N[Output: 64,B,640]
    end
    
    %% Output Subgraph - Top to Bottom
    subgraph OUTPUT ["Output & CTC Layer"]
        N --> O[LayerNorm<br/>Stabilize 640D features]
        O --> P[Linear Layer<br/>640 to 63 classes]
        P --> Q[Output Logits<br/>64,B,63]
        
        Q --> R[CTC Decoding<br/>Remove duplicates & blanks]
        R --> S[Final Prediction<br/>Character sequence]
    end
    
    %% Styling - Darker colors with black text
    classDef inputLayer fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000000
    classDef cnnLayer fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px,color:#000000
    classDef rnnLayer fill:#bbdefb,stroke:#1565c0,stroke-width:3px,color:#000000
    classDef outputLayer fill:#f8bbd9,stroke:#ad1457,stroke-width:3px,color:#000000
    classDef ctcLayer fill:#d1c4e9,stroke:#512da8,stroke-width:3px,color:#000000
    
    class A inputLayer
    class B,C,D,E,F,G,H,I,J,K cnnLayer
    class L,M,N rnnLayer
    class O,P,Q outputLayer
    class R,S ctcLayer

Key Design Features

  • Total Stride: 4 (256 โ†’ 64 timesteps)
  • Height Compression: 60 โ†’ 1 (via pooling)
  • Residual Connections: Prevents gradient vanishing
  • Bidirectional LSTM: Captures context from both directions
  • LayerNorm: Training stability before final classification

Training Optimizations

  • AdamW Optimizer: lr=3e-4, weight_decay=1e-4
  • Gradient Clipping: max_norm=1.0 prevents exploding gradients
  • Weight Initialization: Small uniform weights (-1e-3, 1e-3) for stability
  • Numeric Stability: AMP disabled during initial training for stability

CTC Training

  • Input: Images resized to 60ร—256 (heightร—width)
  • Output: Character sequences (a-z, A-Z, 0-9)
  • Loss: CTCLoss with blank=0, zero_infinity=True
  • Decoding: Greedy CTC decode with duplicate removal

Data Pipeline

  • Images: Grayscale, normalized to [0,1], proper cv2 resizing
  • Labels: CSV with filename and text label
  • Batching: Variable-length sequences with custom CTC collate function
  • Debugging: Real-time monitoring of logits, blank probability, predictions

๐Ÿ“Š Performance Expectations

GTX 1650 (4GB VRAM)

  • Training time: 3-8 hours for 100kร—40 epochs
  • Batch size: 32
  • Memory efficient with H=60, W=256

Tesla T4 (16GB VRAM)

  • Training time: 2-4 hours for 100kร—40 epochs
  • Batch size: 128
  • Mixed precision (AMP) enabled

๐Ÿ› ๏ธ Development Workflow

  1. Implement Dataset class - Load and preprocess images
  2. Build CRNN model - CNN + BiLSTM architecture
  3. Create training loop - With validation and checkpoints
  4. Add metrics - CER and accuracy tracking
  5. Test on small dataset - Verify everything works
  6. Scale to full dataset - Train on Colab

๐Ÿค Contributing

This is a learning project! Feel free to:

  • Ask questions about implementation details
  • Experiment with different architectures
  • Improve the data generation or training pipeline

๐Ÿ“š Resources

๐Ÿ“ License

This project is for educational purposes. Feel free to use and modify as needed.


Happy coding! ๐Ÿš€