CaptchaOCR / README.md
mohakapoor's picture
Initial project setup on Dev branch
ada63c0
|
raw
history blame
4.77 kB

CAPTCHA OCR Project

A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.

๐ŸŽฏ Project Overview

This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:

  • Synthetic CAPTCHA generation for training data
  • CRNN (CNN + RNN) architecture for sequence recognition
  • CTC (Connectionist Temporal Classification) loss for training
  • PyTorch with CUDA support for GPU acceleration

๐Ÿ—๏ธ Current Status

โœ… Completed Components

  • Dataset Generation: Synthetic CAPTCHA creation with train/val/test splits
  • Configuration: Centralized config with image dimensions and training parameters
  • Vocabulary System: Character encoding/decoding with CTC blank token support
  • CTC Collate Function: Proper batching for variable-length sequences
  • CTC Decoding: Greedy decode for inference

๐Ÿ”ง In Progress / Next Steps

  • PyTorch Dataset Class: Image loading and preprocessing
  • CRNN Model: CNN encoder + BiLSTM + linear output
  • Training Loop: Complete training pipeline with validation
  • Metrics: CER (Character Error Rate) and exact match accuracy
  • Inference Pipeline: Model loading and prediction

๐Ÿ“ Project Structure

CaptchaDetect/
โ”œโ”€โ”€ Dataset/                 # Full dataset (100k images) - for Colab training
โ”œโ”€โ”€ Dataset_test/           # Test dataset (1k images) - for local development
โ”‚   โ””โ”€โ”€ captchas/
โ”‚       โ”œโ”€โ”€ train/          # 80% of data
โ”‚       โ”œโ”€โ”€ val/            # 10% of data
โ”‚       โ””โ”€โ”€ test/           # 10% of data
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ config.py           # Configuration and hyperparameters
โ”‚   โ”œโ”€โ”€ vocab.py            # Character vocabulary and CTC encoding
โ”‚   โ”œโ”€โ”€ data.py             # Dataset generation script
โ”‚   โ”œโ”€โ”€ collate.py          # CTC batching function
โ”‚   โ””โ”€โ”€ [model files]       # Coming soon...
โ”œโ”€โ”€ .gitignore              # Ignores dataset contents, keeps structure
โ””โ”€โ”€ README.md               # This file

๐Ÿš€ Quick Start

1. Environment Setup

# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Install other dependencies
pip install captcha pandas pillow

2. Generate Test Dataset

cd src
python data.py

This creates 1,000 synthetic CAPTCHAs in Dataset_test/captchas/ with proper train/val/test splits.

3. Configuration

Edit src/config.py to adjust:

  • Image dimensions (H=48, W_max=224)
  • Batch sizes (32 for local GTX 1650, 128 for Colab T4)
  • Training parameters

๐ŸŽฎ Usage

Local Development (GTX 1650)

  • Use Dataset_test (1k images)
  • Batch size: 32-48
  • Good for rapid iteration and testing

Colab Training (Tesla T4)

  • Use Dataset (100k images)
  • Batch size: 128
  • Expected training time: 2-4 hours for 40 epochs

๐Ÿ”ฌ Technical Details

Model Architecture

  • CNN Encoder: Reduces image to sequence representation
  • BiLSTM: Processes sequential features
  • Linear Output: Maps to vocabulary size (including blank token)

CTC Training

  • Input: Images resized to 48ร—224
  • Output: Character sequences (a-z, A-Z, 0-9)
  • Loss: CTCLoss with blank=0
  • Decoding: Greedy CTC decode

Data Format

  • Images: Grayscale, normalized tensors
  • Labels: CSV with filename and text label
  • Batching: Variable-length sequences handled by custom collate

๐Ÿ“Š Performance Expectations

GTX 1650 (4GB VRAM)

  • Training time: 3-8 hours for 100kร—40 epochs
  • Batch size: 32-48
  • Memory efficient with H=48

Tesla T4 (16GB VRAM)

  • Training time: 2-4 hours for 100kร—40 epochs
  • Batch size: 128
  • Mixed precision (AMP) enabled

๐Ÿ› ๏ธ Development Workflow

  1. Implement Dataset class - Load and preprocess images
  2. Build CRNN model - CNN + BiLSTM architecture
  3. Create training loop - With validation and checkpoints
  4. Add metrics - CER and accuracy tracking
  5. Test on small dataset - Verify everything works
  6. Scale to full dataset - Train on Colab

๐Ÿค Contributing

This is a learning project! Feel free to:

  • Ask questions about implementation details
  • Experiment with different architectures
  • Improve the data generation or training pipeline

๐Ÿ“š Resources

๐Ÿ“ License

This project is for educational purposes. Feel free to use and modify as needed.


Happy coding! ๐Ÿš€