Spaces:
Running
Running
| title: CaptchaOCR | |
| emoji: ๐ | |
| colorFrom: green | |
| colorTo: gray | |
| sdk: gradio | |
| sdk_version: 5.43.1 | |
| app_file: app.py | |
| pinned: false | |
| short_description: CAPTCHA text recognition using CRNN neural networks | |
| # CAPTCHA OCR Project | |
| A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling. | |
| ## ๐ฏ Project Overview | |
| This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses: | |
| - **Synthetic CAPTCHA generation** for training data | |
| - **CRNN (CNN + RNN) architecture** for sequence recognition | |
| - **CTC (Connectionist Temporal Classification)** loss for training | |
| - **PyTorch** with CUDA support for GPU acceleration | |
| ## ๐๏ธ Current Status | |
| ### โ Completed Components | |
| - **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits (8k train, 1k val) | |
| - **Configuration**: Centralized config with image dimensions and training parameters | |
| - **Vocabulary System**: Character encoding/decoding with CTC blank token support (63 classes) | |
| - **CTC Collate Function**: Proper batching for variable-length sequences | |
| - **CTC Decoding**: Greedy decode for inference | |
| - **PyTorch Dataset Class**: Image loading and preprocessing with proper cv2 resizing | |
| - **CRNN Model**: CNN encoder + BiLSTM + LayerNorm + linear output (working!) | |
| - **Training Loop**: Complete epoch-based training pipeline with validation | |
| - **Metrics & Plotting**: Training/validation loss tracking with beautiful visualizations | |
| - **Debugging Tools**: Comprehensive logging of logits, predictions, and model health | |
| ### โ What's Working | |
| - **Training Pipeline**: Stable training loop with excellent loss convergence | |
| - **Model Architecture**: CRNN produces correct output shapes (64รbatchร63) with H=60, W=256 | |
| - **Data Loading**: Proper image preprocessing and CTC batching | |
| - **Full CAPTCHA Recognition**: Model now recognizes complete CAPTCHA sequences | |
| - **Inference Pipeline**: Complete inference script with visualization and accuracy metrics | |
| - **Early Stopping**: Enhanced early stopping prevents overfitting automatically | |
| - **High Accuracy**: 75-100% overall accuracy, 25/26+ character accuracy (96%+) | |
| ### ๐ฏ Training Status | |
| - **Current**: Epoch 8, excellent convergence achieved | |
| - **Best Model**: Validation loss 0.1782, early stopping working perfectly | |
| - **Performance**: 75-100% accuracy on fresh CAPTCHAs (varies by run) | |
| ### ๐ Training Results | |
|  | |
|  | |
| **Key Insights:** | |
| - **Rapid convergence**: Loss dropped from 21โ0.1 in first 7 epochs | |
| - **No overfitting**: Enhanced early stopping prevents overfitting | |
| - **Stable training**: Val/Train ratio stays healthy throughout training | |
| ### ๐ Inference Results | |
|  | |
| **Model Performance:** | |
| - **Visual predictions**: Shows actual CAPTCHA images with predicted text | |
| - **High accuracy**: 75-100% overall accuracy on fresh CAPTCHAs | |
| - **Character-level precision**: 96%+ character accuracy (25/26+ correct) | |
| ## ๐ Project Structure | |
| ``` | |
| CaptchaDetect/ | |
| โโโ Dataset/ # Full dataset (100k images) - for Colab training | |
| โโโ Dataset_test/ # Test dataset (1k images) - for local development | |
| โ โโโ captchas/ | |
| โ โโโ train/ # 80% of data | |
| โ โโโ val/ # 10% of data | |
| โ โโโ test/ # 10% of data | |
| โโโ src/ | |
| โ โโโ config.py # Configuration and hyperparameters | |
| โ โโโ vocab.py # Character vocabulary and CTC encoding/decoding | |
| โ โโโ data.py # Dataset generation script | |
| โ โโโ collate.py # CTC batching function | |
| โ โโโ captcha_dataset.py # PyTorch Dataset class | |
| โ โโโ model_crnn.py # CRNN model architecture | |
| โ โโโ plotting.py # Training metrics and visualization | |
| โโโ train.py # Main training script (โ WORKING!) | |
| โโโ Metrics/ # Training plots and logs (auto-generated) | |
| โโโ .gitignore # Ignores dataset contents, keeps structure | |
| โโโ README.md # This file | |
| ``` | |
| ## ๐ Quick Start | |
| ### 1. Environment Setup | |
| ```bash | |
| # Install PyTorch with CUDA support (adjust version as needed) | |
| pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128 | |
| # Install other dependencies | |
| pip install captcha pandas pillow | |
| ``` | |
| ### 2. Generate Training Dataset | |
| ```bash | |
| cd src | |
| python data.py | |
| ``` | |
| This creates 10,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits. | |
| ### 3. Start Training | |
| ```bash | |
| python train.py | |
| ``` | |
| This starts the full training pipeline with automatic metrics generation. | |
| ### 4. Monitor Progress | |
| Training will show: | |
| - Real-time loss and prediction samples | |
| - Automatic plot generation in `Metrics/` folder | |
| - Comprehensive training logs and summaries | |
| ## ๐ฎ Usage | |
| ### Training | |
| ```bash | |
| python train.py | |
| ``` | |
| - **Automatic early stopping** prevents overfitting | |
| - **Real-time metrics** and sample predictions | |
| - **Checkpoint saving** for best model | |
| ### Inference | |
| ```bash | |
| python inference.py | |
| ``` | |
| - **Loads best trained model** automatically | |
| - **Generates test CAPTCHAs** for evaluation | |
| - **Shows both overall and character accuracy** | |
| - **Creates visualization plots** in Metrics folder | |
| ## ๐ฌ Technical Details | |
| ### Model Architecture (CRNN) | |
| The model uses a **CNN + RNN + CTC** architecture specifically designed for sequence recognition: | |
| ```mermaid | |
| graph TD | |
| %% Input Layer | |
| A[Input Image<br/>60x256x1] --> B[CNN Encoder<br/>SmallCNN] | |
| %% CNN Subgraph - Top to Bottom | |
| subgraph CNN ["CNN Encoder Layer"] | |
| B --> C[Conv1 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 2x2] | |
| C --> D[Channels: 1 to 64<br/>Spatial: 60x256 to 30x128] | |
| D --> E[Conv2 Block<br/>3x3 Conv + BatchNorm + ReLU<br/>MaxPool 1x2] | |
| E --> F[Channels: 64 to 128<br/>Spatial: 30x128 to 30x64] | |
| F --> G[Residual Block<br/>3x3 Conv + BatchNorm + ReLU<br/>3x3 Conv + BatchNorm<br/>+ Skip Connection] | |
| G --> H[Maintains: 128 channels, 30x64 spatial] | |
| H --> I[Height Pooling<br/>AdaptiveAvgPool2d 1xNone] | |
| I --> J[Squeeze Height<br/>30x64 to 1x64] | |
| J --> K[Permute & Reshape<br/>B,128,1,64 to 64,B,128] | |
| end | |
| %% RNN Subgraph - Top to Bottom | |
| subgraph RNN ["RNN Decoder Layer"] | |
| K --> L[RNN Decoder<br/>2-Layer BiLSTM] | |
| L --> M[Hidden Size: 320 per direction<br/>Total: 640 features] | |
| M --> N[Output: 64,B,640] | |
| end | |
| %% Output Subgraph - Top to Bottom | |
| subgraph OUTPUT ["Output & CTC Layer"] | |
| N --> O[LayerNorm<br/>Stabilize 640D features] | |
| O --> P[Linear Layer<br/>640 to 63 classes] | |
| P --> Q[Output Logits<br/>64,B,63] | |
| Q --> R[CTC Decoding<br/>Remove duplicates & blanks] | |
| R --> S[Final Prediction<br/>Character sequence] | |
| end | |
| %% Styling - Darker colors with black text | |
| classDef inputLayer fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000000 | |
| classDef cnnLayer fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px,color:#000000 | |
| classDef rnnLayer fill:#bbdefb,stroke:#1565c0,stroke-width:3px,color:#000000 | |
| classDef outputLayer fill:#f8bbd9,stroke:#ad1457,stroke-width:3px,color:#000000 | |
| classDef ctcLayer fill:#d1c4e9,stroke:#512da8,stroke-width:3px,color:#000000 | |
| class A inputLayer | |
| class B,C,D,E,F,G,H,I,J,K cnnLayer | |
| class L,M,N rnnLayer | |
| class O,P,Q outputLayer | |
| class R,S ctcLayer | |
| ``` | |
| #### **Key Design Features** | |
| - **Total Stride**: 4 (256 โ 64 timesteps) | |
| - **Height Compression**: 60 โ 1 (via pooling) | |
| - **Residual Connections**: Prevents gradient vanishing | |
| - **Bidirectional LSTM**: Captures context from both directions | |
| - **LayerNorm**: Training stability before final classification | |
| ### Training Optimizations | |
| - **AdamW Optimizer**: lr=3e-4, weight_decay=1e-4 | |
| - **Gradient Clipping**: max_norm=1.0 prevents exploding gradients | |
| - **Weight Initialization**: Small uniform weights (-1e-3, 1e-3) for stability | |
| - **Numeric Stability**: AMP disabled during initial training for stability | |
| ### CTC Training | |
| - **Input**: Images resized to 60ร256 (heightรwidth) | |
| - **Output**: Character sequences (a-z, A-Z, 0-9) | |
| - **Loss**: CTCLoss with blank=0, zero_infinity=True | |
| - **Decoding**: Greedy CTC decode with duplicate removal | |
| ### Data Pipeline | |
| - **Images**: Grayscale, normalized to [0,1], proper cv2 resizing | |
| - **Labels**: CSV with filename and text label | |
| - **Batching**: Variable-length sequences with custom CTC collate function | |
| - **Debugging**: Real-time monitoring of logits, blank probability, predictions | |
| ## ๐ Performance Expectations | |
| ### GTX 1650 (4GB VRAM) | |
| - Training time: 3-8 hours for 100kร40 epochs | |
| - Batch size: 32 | |
| - Memory efficient with H=60, W=256 | |
| ### Tesla T4 (16GB VRAM) | |
| - Training time: 2-4 hours for 100kร40 epochs | |
| - Batch size: 128 | |
| - Mixed precision (AMP) enabled | |
| ## ๐ ๏ธ Development Workflow | |
| 1. **Implement Dataset class** - Load and preprocess images | |
| 2. **Build CRNN model** - CNN + BiLSTM architecture | |
| 3. **Create training loop** - With validation and checkpoints | |
| 4. **Add metrics** - CER and accuracy tracking | |
| 5. **Test on small dataset** - Verify everything works | |
| 6. **Scale to full dataset** - Train on Colab | |
| ## ๐ค Contributing | |
| This is a learning project! Feel free to: | |
| - Ask questions about implementation details | |
| - Experiment with different architectures | |
| - Improve the data generation or training pipeline | |
| ## ๐ Resources | |
| - [CTC Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf) | |
| - [CRNN Architecture](https://arxiv.org/abs/1507.05717) | |
| - [PyTorch CTC Tutorial](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html) | |
| ## ๐ License | |
| This project is for educational purposes. Feel free to use and modify as needed. | |
| --- | |
| **Happy coding! ๐** | |