CaptchaOCR / README.md
mohakapoor's picture
Initial project setup on Dev branch
ada63c0
|
raw
history blame
4.77 kB
# CAPTCHA OCR Project
A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.
## ๐ŸŽฏ Project Overview
This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:
- **Synthetic CAPTCHA generation** for training data
- **CRNN (CNN + RNN) architecture** for sequence recognition
- **CTC (Connectionist Temporal Classification)** loss for training
- **PyTorch** with CUDA support for GPU acceleration
## ๐Ÿ—๏ธ Current Status
### โœ… Completed Components
- **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits
- **Configuration**: Centralized config with image dimensions and training parameters
- **Vocabulary System**: Character encoding/decoding with CTC blank token support
- **CTC Collate Function**: Proper batching for variable-length sequences
- **CTC Decoding**: Greedy decode for inference
### ๐Ÿ”ง In Progress / Next Steps
- **PyTorch Dataset Class**: Image loading and preprocessing
- **CRNN Model**: CNN encoder + BiLSTM + linear output
- **Training Loop**: Complete training pipeline with validation
- **Metrics**: CER (Character Error Rate) and exact match accuracy
- **Inference Pipeline**: Model loading and prediction
## ๐Ÿ“ Project Structure
```
CaptchaDetect/
โ”œโ”€โ”€ Dataset/ # Full dataset (100k images) - for Colab training
โ”œโ”€โ”€ Dataset_test/ # Test dataset (1k images) - for local development
โ”‚ โ””โ”€โ”€ captchas/
โ”‚ โ”œโ”€โ”€ train/ # 80% of data
โ”‚ โ”œโ”€โ”€ val/ # 10% of data
โ”‚ โ””โ”€โ”€ test/ # 10% of data
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ config.py # Configuration and hyperparameters
โ”‚ โ”œโ”€โ”€ vocab.py # Character vocabulary and CTC encoding
โ”‚ โ”œโ”€โ”€ data.py # Dataset generation script
โ”‚ โ”œโ”€โ”€ collate.py # CTC batching function
โ”‚ โ””โ”€โ”€ [model files] # Coming soon...
โ”œโ”€โ”€ .gitignore # Ignores dataset contents, keeps structure
โ””โ”€โ”€ README.md # This file
```
## ๐Ÿš€ Quick Start
### 1. Environment Setup
```bash
# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# Install other dependencies
pip install captcha pandas pillow
```
### 2. Generate Test Dataset
```bash
cd src
python data.py
```
This creates 1,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits.
### 3. Configuration
Edit `src/config.py` to adjust:
- Image dimensions (H=48, W_max=224)
- Batch sizes (32 for local GTX 1650, 128 for Colab T4)
- Training parameters
## ๐ŸŽฎ Usage
### Local Development (GTX 1650)
- Use `Dataset_test` (1k images)
- Batch size: 32-48
- Good for rapid iteration and testing
### Colab Training (Tesla T4)
- Use `Dataset` (100k images)
- Batch size: 128
- Expected training time: 2-4 hours for 40 epochs
## ๐Ÿ”ฌ Technical Details
### Model Architecture
- **CNN Encoder**: Reduces image to sequence representation
- **BiLSTM**: Processes sequential features
- **Linear Output**: Maps to vocabulary size (including blank token)
### CTC Training
- **Input**: Images resized to 48ร—224
- **Output**: Character sequences (a-z, A-Z, 0-9)
- **Loss**: CTCLoss with blank=0
- **Decoding**: Greedy CTC decode
### Data Format
- **Images**: Grayscale, normalized tensors
- **Labels**: CSV with filename and text label
- **Batching**: Variable-length sequences handled by custom collate
## ๐Ÿ“Š Performance Expectations
### GTX 1650 (4GB VRAM)
- Training time: 3-8 hours for 100kร—40 epochs
- Batch size: 32-48
- Memory efficient with H=48
### Tesla T4 (16GB VRAM)
- Training time: 2-4 hours for 100kร—40 epochs
- Batch size: 128
- Mixed precision (AMP) enabled
## ๐Ÿ› ๏ธ Development Workflow
1. **Implement Dataset class** - Load and preprocess images
2. **Build CRNN model** - CNN + BiLSTM architecture
3. **Create training loop** - With validation and checkpoints
4. **Add metrics** - CER and accuracy tracking
5. **Test on small dataset** - Verify everything works
6. **Scale to full dataset** - Train on Colab
## ๐Ÿค Contributing
This is a learning project! Feel free to:
- Ask questions about implementation details
- Experiment with different architectures
- Improve the data generation or training pipeline
## ๐Ÿ“š Resources
- [CTC Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
- [CRNN Architecture](https://arxiv.org/abs/1507.05717)
- [PyTorch CTC Tutorial](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html)
## ๐Ÿ“ License
This project is for educational purposes. Feel free to use and modify as needed.
---
**Happy coding! ๐Ÿš€**