Spaces:
Running
Running
CAPTCHA OCR Project
A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.
๐ฏ Project Overview
This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:
- Synthetic CAPTCHA generation for training data
- CRNN (CNN + RNN) architecture for sequence recognition
- CTC (Connectionist Temporal Classification) loss for training
- PyTorch with CUDA support for GPU acceleration
๐๏ธ Current Status
โ Completed Components
- Dataset Generation: Synthetic CAPTCHA creation with train/val/test splits
- Configuration: Centralized config with image dimensions and training parameters
- Vocabulary System: Character encoding/decoding with CTC blank token support
- CTC Collate Function: Proper batching for variable-length sequences
- CTC Decoding: Greedy decode for inference
๐ง In Progress / Next Steps
- PyTorch Dataset Class: Image loading and preprocessing
- CRNN Model: CNN encoder + BiLSTM + linear output
- Training Loop: Complete training pipeline with validation
- Metrics: CER (Character Error Rate) and exact match accuracy
- Inference Pipeline: Model loading and prediction
๐ Project Structure
CaptchaDetect/
โโโ Dataset/ # Full dataset (100k images) - for Colab training
โโโ Dataset_test/ # Test dataset (1k images) - for local development
โ โโโ captchas/
โ โโโ train/ # 80% of data
โ โโโ val/ # 10% of data
โ โโโ test/ # 10% of data
โโโ src/
โ โโโ config.py # Configuration and hyperparameters
โ โโโ vocab.py # Character vocabulary and CTC encoding
โ โโโ data.py # Dataset generation script
โ โโโ collate.py # CTC batching function
โ โโโ [model files] # Coming soon...
โโโ .gitignore # Ignores dataset contents, keeps structure
โโโ README.md # This file
๐ Quick Start
1. Environment Setup
# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# Install other dependencies
pip install captcha pandas pillow
2. Generate Test Dataset
cd src
python data.py
This creates 1,000 synthetic CAPTCHAs in Dataset_test/captchas/ with proper train/val/test splits.
3. Configuration
Edit src/config.py to adjust:
- Image dimensions (H=48, W_max=224)
- Batch sizes (32 for local GTX 1650, 128 for Colab T4)
- Training parameters
๐ฎ Usage
Local Development (GTX 1650)
- Use
Dataset_test(1k images) - Batch size: 32-48
- Good for rapid iteration and testing
Colab Training (Tesla T4)
- Use
Dataset(100k images) - Batch size: 128
- Expected training time: 2-4 hours for 40 epochs
๐ฌ Technical Details
Model Architecture
- CNN Encoder: Reduces image to sequence representation
- BiLSTM: Processes sequential features
- Linear Output: Maps to vocabulary size (including blank token)
CTC Training
- Input: Images resized to 48ร224
- Output: Character sequences (a-z, A-Z, 0-9)
- Loss: CTCLoss with blank=0
- Decoding: Greedy CTC decode
Data Format
- Images: Grayscale, normalized tensors
- Labels: CSV with filename and text label
- Batching: Variable-length sequences handled by custom collate
๐ Performance Expectations
GTX 1650 (4GB VRAM)
- Training time: 3-8 hours for 100kร40 epochs
- Batch size: 32-48
- Memory efficient with H=48
Tesla T4 (16GB VRAM)
- Training time: 2-4 hours for 100kร40 epochs
- Batch size: 128
- Mixed precision (AMP) enabled
๐ ๏ธ Development Workflow
- Implement Dataset class - Load and preprocess images
- Build CRNN model - CNN + BiLSTM architecture
- Create training loop - With validation and checkpoints
- Add metrics - CER and accuracy tracking
- Test on small dataset - Verify everything works
- Scale to full dataset - Train on Colab
๐ค Contributing
This is a learning project! Feel free to:
- Ask questions about implementation details
- Experiment with different architectures
- Improve the data generation or training pipeline
๐ Resources
๐ License
This project is for educational purposes. Feel free to use and modify as needed.
Happy coding! ๐