Spaces:

mohakapoor
/

CaptchaOCR

Running

App Files Files Community

CaptchaOCR / README.md

mohakapoor

Initial project setup on Dev branch

ada63c0 4 months ago

preview code

raw

history blame

4.77 kB

CAPTCHA OCR Project

A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.

🎯 Project Overview

This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:

Synthetic CAPTCHA generation for training data
CRNN (CNN + RNN) architecture for sequence recognition
CTC (Connectionist Temporal Classification) loss for training
PyTorch with CUDA support for GPU acceleration

🏗️ Current Status

✅ Completed Components

Dataset Generation: Synthetic CAPTCHA creation with train/val/test splits
Configuration: Centralized config with image dimensions and training parameters
Vocabulary System: Character encoding/decoding with CTC blank token support
CTC Collate Function: Proper batching for variable-length sequences
CTC Decoding: Greedy decode for inference

🔧 In Progress / Next Steps

PyTorch Dataset Class: Image loading and preprocessing
CRNN Model: CNN encoder + BiLSTM + linear output
Training Loop: Complete training pipeline with validation
Metrics: CER (Character Error Rate) and exact match accuracy
Inference Pipeline: Model loading and prediction

📁 Project Structure

CaptchaDetect/
├── Dataset/                 # Full dataset (100k images) - for Colab training
├── Dataset_test/           # Test dataset (1k images) - for local development
│   └── captchas/
│       ├── train/          # 80% of data
│       ├── val/            # 10% of data
│       └── test/           # 10% of data
├── src/
│   ├── config.py           # Configuration and hyperparameters
│   ├── vocab.py            # Character vocabulary and CTC encoding
│   ├── data.py             # Dataset generation script
│   ├── collate.py          # CTC batching function
│   └── [model files]       # Coming soon...
├── .gitignore              # Ignores dataset contents, keeps structure
└── README.md               # This file

🚀 Quick Start

1. Environment Setup

# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Install other dependencies
pip install captcha pandas pillow

2. Generate Test Dataset

cd src
python data.py

This creates 1,000 synthetic CAPTCHAs in Dataset_test/captchas/ with proper train/val/test splits.

3. Configuration

Edit src/config.py to adjust:

Image dimensions (H=48, W_max=224)
Batch sizes (32 for local GTX 1650, 128 for Colab T4)
Training parameters

🎮 Usage

Local Development (GTX 1650)

Use Dataset_test (1k images)
Batch size: 32-48
Good for rapid iteration and testing

Colab Training (Tesla T4)

Use Dataset (100k images)
Batch size: 128
Expected training time: 2-4 hours for 40 epochs

🔬 Technical Details

Model Architecture

CNN Encoder: Reduces image to sequence representation
BiLSTM: Processes sequential features
Linear Output: Maps to vocabulary size (including blank token)

CTC Training

Input: Images resized to 48×224
Output: Character sequences (a-z, A-Z, 0-9)
Loss: CTCLoss with blank=0
Decoding: Greedy CTC decode

Data Format

Images: Grayscale, normalized tensors
Labels: CSV with filename and text label
Batching: Variable-length sequences handled by custom collate

📊 Performance Expectations

GTX 1650 (4GB VRAM)

Training time: 3-8 hours for 100k×40 epochs
Batch size: 32-48
Memory efficient with H=48

Tesla T4 (16GB VRAM)

Training time: 2-4 hours for 100k×40 epochs
Batch size: 128
Mixed precision (AMP) enabled

🛠️ Development Workflow

Implement Dataset class - Load and preprocess images
Build CRNN model - CNN + BiLSTM architecture
Create training loop - With validation and checkpoints
Add metrics - CER and accuracy tracking
Test on small dataset - Verify everything works
Scale to full dataset - Train on Colab

🤝 Contributing

This is a learning project! Feel free to:

Ask questions about implementation details
Experiment with different architectures
Improve the data generation or training pipeline

📚 Resources

📝 License

This project is for educational purposes. Feel free to use and modify as needed.

Happy coding! 🚀