File size: 4,768 Bytes
ada63c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# CAPTCHA OCR Project

A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.

## ๐ŸŽฏ Project Overview

This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:
- **Synthetic CAPTCHA generation** for training data
- **CRNN (CNN + RNN) architecture** for sequence recognition
- **CTC (Connectionist Temporal Classification)** loss for training
- **PyTorch** with CUDA support for GPU acceleration

## ๐Ÿ—๏ธ Current Status

### โœ… Completed Components
- **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits
- **Configuration**: Centralized config with image dimensions and training parameters
- **Vocabulary System**: Character encoding/decoding with CTC blank token support
- **CTC Collate Function**: Proper batching for variable-length sequences
- **CTC Decoding**: Greedy decode for inference

### ๐Ÿ”ง In Progress / Next Steps
- **PyTorch Dataset Class**: Image loading and preprocessing
- **CRNN Model**: CNN encoder + BiLSTM + linear output
- **Training Loop**: Complete training pipeline with validation
- **Metrics**: CER (Character Error Rate) and exact match accuracy
- **Inference Pipeline**: Model loading and prediction

## ๐Ÿ“ Project Structure

```
CaptchaDetect/
โ”œโ”€โ”€ Dataset/                 # Full dataset (100k images) - for Colab training
โ”œโ”€โ”€ Dataset_test/           # Test dataset (1k images) - for local development
โ”‚   โ””โ”€โ”€ captchas/
โ”‚       โ”œโ”€โ”€ train/          # 80% of data
โ”‚       โ”œโ”€โ”€ val/            # 10% of data
โ”‚       โ””โ”€โ”€ test/           # 10% of data
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ config.py           # Configuration and hyperparameters
โ”‚   โ”œโ”€โ”€ vocab.py            # Character vocabulary and CTC encoding
โ”‚   โ”œโ”€โ”€ data.py             # Dataset generation script
โ”‚   โ”œโ”€โ”€ collate.py          # CTC batching function
โ”‚   โ””โ”€โ”€ [model files]       # Coming soon...
โ”œโ”€โ”€ .gitignore              # Ignores dataset contents, keeps structure
โ””โ”€โ”€ README.md               # This file
```

## ๐Ÿš€ Quick Start

### 1. Environment Setup
```bash
# Install PyTorch with CUDA support (adjust version as needed)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Install other dependencies
pip install captcha pandas pillow
```

### 2. Generate Test Dataset
```bash
cd src
python data.py
```
This creates 1,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits.

### 3. Configuration
Edit `src/config.py` to adjust:
- Image dimensions (H=48, W_max=224)
- Batch sizes (32 for local GTX 1650, 128 for Colab T4)
- Training parameters

## ๐ŸŽฎ Usage

### Local Development (GTX 1650)
- Use `Dataset_test` (1k images)
- Batch size: 32-48
- Good for rapid iteration and testing

### Colab Training (Tesla T4)
- Use `Dataset` (100k images)
- Batch size: 128
- Expected training time: 2-4 hours for 40 epochs

## ๐Ÿ”ฌ Technical Details

### Model Architecture
- **CNN Encoder**: Reduces image to sequence representation
- **BiLSTM**: Processes sequential features
- **Linear Output**: Maps to vocabulary size (including blank token)

### CTC Training
- **Input**: Images resized to 48ร—224
- **Output**: Character sequences (a-z, A-Z, 0-9)
- **Loss**: CTCLoss with blank=0
- **Decoding**: Greedy CTC decode

### Data Format
- **Images**: Grayscale, normalized tensors
- **Labels**: CSV with filename and text label
- **Batching**: Variable-length sequences handled by custom collate

## ๐Ÿ“Š Performance Expectations

### GTX 1650 (4GB VRAM)
- Training time: 3-8 hours for 100kร—40 epochs
- Batch size: 32-48
- Memory efficient with H=48

### Tesla T4 (16GB VRAM)
- Training time: 2-4 hours for 100kร—40 epochs
- Batch size: 128
- Mixed precision (AMP) enabled

## ๐Ÿ› ๏ธ Development Workflow

1. **Implement Dataset class** - Load and preprocess images
2. **Build CRNN model** - CNN + BiLSTM architecture
3. **Create training loop** - With validation and checkpoints
4. **Add metrics** - CER and accuracy tracking
5. **Test on small dataset** - Verify everything works
6. **Scale to full dataset** - Train on Colab

## ๐Ÿค Contributing

This is a learning project! Feel free to:
- Ask questions about implementation details
- Experiment with different architectures
- Improve the data generation or training pipeline

## ๐Ÿ“š Resources

- [CTC Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
- [CRNN Architecture](https://arxiv.org/abs/1507.05717)
- [PyTorch CTC Tutorial](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html)

## ๐Ÿ“ License

This project is for educational purposes. Feel free to use and modify as needed.

---

**Happy coding! ๐Ÿš€**