Spaces:
Running
Running
File size: 4,768 Bytes
ada63c0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
# CAPTCHA OCR Project
A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.
## ๐ฏ Project Overview
This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:
- **Synthetic CAPTCHA generation** for training data
- **CRNN (CNN + RNN) architecture** for sequence recognition
- **CTC (Connectionist Temporal Classification)** loss for training
- **PyTorch** with CUDA support for GPU acceleration
## ๐๏ธ Current Status
### โ
Completed Components
- **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits
- **Configuration**: Centralized config with image dimensions and training parameters
- **Vocabulary System**: Character encoding/decoding with CTC blank token support
- **CTC Collate Function**: Proper batching for variable-length sequences
- **CTC Decoding**: Greedy decode for inference
### ๐ง In Progress / Next Steps
- **PyTorch Dataset Class**: Image loading and preprocessing
- **CRNN Model**: CNN encoder + BiLSTM + linear output
- **Training Loop**: Complete training pipeline with validation
- **Metrics**: CER (Character Error Rate) and exact match accuracy
- **Inference Pipeline**: Model loading and prediction
## ๐ Project Structure
```
CaptchaDetect/
โโโ Dataset/ # Full dataset (100k images) - for Colab training
โโโ Dataset_test/ # Test dataset (1k images) - for local development
โ โโโ captchas/
โ โโโ train/ # 80% of data
โ โโโ val/ # 10% of data
โ โโโ test/ # 10% of data
โโโ src/
โ โโโ config.py # Configuration and hyperparameters
โ โโโ vocab.py # Character vocabulary and CTC encoding
โ โโโ data.py # Dataset generation script
โ โโโ collate.py # CTC batching function
โ โโโ [model files] # Coming soon...
โโโ .gitignore # Ignores dataset contents, keeps structure
โโโ README.md # This file
```
## ๐ Quick Start
### 1. Environment Setup
```bash
# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# Install other dependencies
pip install captcha pandas pillow
```
### 2. Generate Test Dataset
```bash
cd src
python data.py
```
This creates 1,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits.
### 3. Configuration
Edit `src/config.py` to adjust:
- Image dimensions (H=48, W_max=224)
- Batch sizes (32 for local GTX 1650, 128 for Colab T4)
- Training parameters
## ๐ฎ Usage
### Local Development (GTX 1650)
- Use `Dataset_test` (1k images)
- Batch size: 32-48
- Good for rapid iteration and testing
### Colab Training (Tesla T4)
- Use `Dataset` (100k images)
- Batch size: 128
- Expected training time: 2-4 hours for 40 epochs
## ๐ฌ Technical Details
### Model Architecture
- **CNN Encoder**: Reduces image to sequence representation
- **BiLSTM**: Processes sequential features
- **Linear Output**: Maps to vocabulary size (including blank token)
### CTC Training
- **Input**: Images resized to 48ร224
- **Output**: Character sequences (a-z, A-Z, 0-9)
- **Loss**: CTCLoss with blank=0
- **Decoding**: Greedy CTC decode
### Data Format
- **Images**: Grayscale, normalized tensors
- **Labels**: CSV with filename and text label
- **Batching**: Variable-length sequences handled by custom collate
## ๐ Performance Expectations
### GTX 1650 (4GB VRAM)
- Training time: 3-8 hours for 100kร40 epochs
- Batch size: 32-48
- Memory efficient with H=48
### Tesla T4 (16GB VRAM)
- Training time: 2-4 hours for 100kร40 epochs
- Batch size: 128
- Mixed precision (AMP) enabled
## ๐ ๏ธ Development Workflow
1. **Implement Dataset class** - Load and preprocess images
2. **Build CRNN model** - CNN + BiLSTM architecture
3. **Create training loop** - With validation and checkpoints
4. **Add metrics** - CER and accuracy tracking
5. **Test on small dataset** - Verify everything works
6. **Scale to full dataset** - Train on Colab
## ๐ค Contributing
This is a learning project! Feel free to:
- Ask questions about implementation details
- Experiment with different architectures
- Improve the data generation or training pipeline
## ๐ Resources
- [CTC Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
- [CRNN Architecture](https://arxiv.org/abs/1507.05717)
- [PyTorch CTC Tutorial](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html)
## ๐ License
This project is for educational purposes. Feel free to use and modify as needed.
---
**Happy coding! ๐**
|