Spaces:
Sleeping
Sleeping
| # CAPTCHA OCR Project | |
| A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling. | |
| ## ๐ฏ Project Overview | |
| This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses: | |
| - **Synthetic CAPTCHA generation** for training data | |
| - **CRNN (CNN + RNN) architecture** for sequence recognition | |
| - **CTC (Connectionist Temporal Classification)** loss for training | |
| - **PyTorch** with CUDA support for GPU acceleration | |
| ## ๐๏ธ Current Status | |
| ### โ Completed Components | |
| - **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits | |
| - **Configuration**: Centralized config with image dimensions and training parameters | |
| - **Vocabulary System**: Character encoding/decoding with CTC blank token support | |
| - **CTC Collate Function**: Proper batching for variable-length sequences | |
| - **CTC Decoding**: Greedy decode for inference | |
| ### ๐ง In Progress / Next Steps | |
| - **PyTorch Dataset Class**: Image loading and preprocessing | |
| - **CRNN Model**: CNN encoder + BiLSTM + linear output | |
| - **Training Loop**: Complete training pipeline with validation | |
| - **Metrics**: CER (Character Error Rate) and exact match accuracy | |
| - **Inference Pipeline**: Model loading and prediction | |
| ## ๐ Project Structure | |
| ``` | |
| CaptchaDetect/ | |
| โโโ Dataset/ # Full dataset (100k images) - for Colab training | |
| โโโ Dataset_test/ # Test dataset (1k images) - for local development | |
| โ โโโ captchas/ | |
| โ โโโ train/ # 80% of data | |
| โ โโโ val/ # 10% of data | |
| โ โโโ test/ # 10% of data | |
| โโโ src/ | |
| โ โโโ config.py # Configuration and hyperparameters | |
| โ โโโ vocab.py # Character vocabulary and CTC encoding | |
| โ โโโ data.py # Dataset generation script | |
| โ โโโ collate.py # CTC batching function | |
| โ โโโ [model files] # Coming soon... | |
| โโโ .gitignore # Ignores dataset contents, keeps structure | |
| โโโ README.md # This file | |
| ``` | |
| ## ๐ Quick Start | |
| ### 1. Environment Setup | |
| ```bash | |
| # Install PyTorch with CUDA support (adjust version as needed) | |
| pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128 | |
| # Install other dependencies | |
| pip install captcha pandas pillow | |
| ``` | |
| ### 2. Generate Test Dataset | |
| ```bash | |
| cd src | |
| python data.py | |
| ``` | |
| This creates 1,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits. | |
| ### 3. Configuration | |
| Edit `src/config.py` to adjust: | |
| - Image dimensions (H=48, W_max=224) | |
| - Batch sizes (32 for local GTX 1650, 128 for Colab T4) | |
| - Training parameters | |
| ## ๐ฎ Usage | |
| ### Local Development (GTX 1650) | |
| - Use `Dataset_test` (1k images) | |
| - Batch size: 32-48 | |
| - Good for rapid iteration and testing | |
| ### Colab Training (Tesla T4) | |
| - Use `Dataset` (100k images) | |
| - Batch size: 128 | |
| - Expected training time: 2-4 hours for 40 epochs | |
| ## ๐ฌ Technical Details | |
| ### Model Architecture | |
| - **CNN Encoder**: Reduces image to sequence representation | |
| - **BiLSTM**: Processes sequential features | |
| - **Linear Output**: Maps to vocabulary size (including blank token) | |
| ### CTC Training | |
| - **Input**: Images resized to 48ร224 | |
| - **Output**: Character sequences (a-z, A-Z, 0-9) | |
| - **Loss**: CTCLoss with blank=0 | |
| - **Decoding**: Greedy CTC decode | |
| ### Data Format | |
| - **Images**: Grayscale, normalized tensors | |
| - **Labels**: CSV with filename and text label | |
| - **Batching**: Variable-length sequences handled by custom collate | |
| ## ๐ Performance Expectations | |
| ### GTX 1650 (4GB VRAM) | |
| - Training time: 3-8 hours for 100kร40 epochs | |
| - Batch size: 32-48 | |
| - Memory efficient with H=48 | |
| ### Tesla T4 (16GB VRAM) | |
| - Training time: 2-4 hours for 100kร40 epochs | |
| - Batch size: 128 | |
| - Mixed precision (AMP) enabled | |
| ## ๐ ๏ธ Development Workflow | |
| 1. **Implement Dataset class** - Load and preprocess images | |
| 2. **Build CRNN model** - CNN + BiLSTM architecture | |
| 3. **Create training loop** - With validation and checkpoints | |
| 4. **Add metrics** - CER and accuracy tracking | |
| 5. **Test on small dataset** - Verify everything works | |
| 6. **Scale to full dataset** - Train on Colab | |
| ## ๐ค Contributing | |
| This is a learning project! Feel free to: | |
| - Ask questions about implementation details | |
| - Experiment with different architectures | |
| - Improve the data generation or training pipeline | |
| ## ๐ Resources | |
| - [CTC Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf) | |
| - [CRNN Architecture](https://arxiv.org/abs/1507.05717) | |
| - [PyTorch CTC Tutorial](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html) | |
| ## ๐ License | |
| This project is for educational purposes. Feel free to use and modify as needed. | |
| --- | |
| **Happy coding! ๐** | |