Spaces:

mohakapoor
/

CaptchaOCR

Running

App Files Files Community

mohakapoor commited on Aug 17

Commit

6e89f30

1 Parent(s): ada63c0

checkpoint

Browse files

Files changed (12) hide show

.gitignore +2 -12
Metrics/.gitkeep +3 -0
Metrics/training_metrics.txt +3 -0
README.md +59 -28
src/captcha_dataset.py +49 -0
src/config.py +21 -8
src/data.py +2 -2
src/model_crnn.py +80 -0
src/plotting.py +107 -0
src/test.py +49 -0
train.py +206 -0
train_sanity.py +96 -0

.gitignore CHANGED Viewed

@@ -1,12 +1,4 @@
-```bash
-#!/usr/bin/env bash
-# Create a .gitignore that keeps the Dataset folder but ignores its contents,
-# plus common Python/ML ignores. Run this from your repo root.
-set -e
-cat > .gitignore << 'EOF'
-# Keep the Dataset folder but ignore its contents
 Dataset/
 !Dataset/.gitkeep
 !Dataset/**/
@@ -32,7 +24,6 @@ pip-wheel-metadata/
 wheels/
 .pytest_cache/
 .coverage
-#.coverage.*  # uncomment if you create multiple coverage files
 htmlcov/
 .cache/
 .mypy_cache/
@@ -75,7 +66,7 @@ logs/
 Thumbs.db
 desktop.ini
-# Images/artifacts (remove if you plan to commit images outside Dataset)
 *.png
 *.jpg
 *.jpeg
@@ -143,5 +134,4 @@ cmake-build-*/
 *.class
 .gradle/
 build/
-EOF

+# Keep Dataset folders but ignore their contents
 Dataset/
 !Dataset/.gitkeep
 !Dataset/**/
 wheels/
 .pytest_cache/
 .coverage
 htmlcov/
 .cache/
 .mypy_cache/
 Thumbs.db
 desktop.ini
+# Images/artifacts
 *.png
 *.jpg
 *.jpeg
 *.class
 .gradle/
 build/

Metrics/.gitkeep ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f67f029777a688ea90615d9a21b1935347c102ecee39cb5d50f740a4e95095eb
+size 122

Metrics/training_metrics.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:edef5b371d7b2c75153063f41c43f0e3dff8d58d5fda50e7a0db52d230e04f3c
+size 807

README.md CHANGED Viewed

@@ -13,18 +13,31 @@ This project implements an end-to-end CAPTCHA OCR system that can recognize text
 ## 🏗️ Current Status
 ### ✅ Completed Components
-- **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits
 - **Configuration**: Centralized config with image dimensions and training parameters
-- **Vocabulary System**: Character encoding/decoding with CTC blank token support
 - **CTC Collate Function**: Proper batching for variable-length sequences
 - **CTC Decoding**: Greedy decode for inference
-### 🔧 In Progress / Next Steps
-- **PyTorch Dataset Class**: Image loading and preprocessing
-- **CRNN Model**: CNN encoder + BiLSTM + linear output
-- **Training Loop**: Complete training pipeline with validation
-- **Metrics**: CER (Character Error Rate) and exact match accuracy
-- **Inference Pipeline**: Model loading and prediction
 ## 📁 Project Structure
@@ -38,10 +51,14 @@ CaptchaDetect/
 │       └── test/           # 10% of data
 ├── src/
 │   ├── config.py           # Configuration and hyperparameters
-│   ├── vocab.py            # Character vocabulary and CTC encoding
 │   ├── data.py             # Dataset generation script
 │   ├── collate.py          # CTC batching function
-│   └── [model files]       # Coming soon...
 ├── .gitignore              # Ignores dataset contents, keeps structure
 └── README.md               # This file
 ```
@@ -57,18 +74,24 @@ pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu12
 pip install captcha pandas pillow
 ```
-### 2. Generate Test Dataset
 ```bash
 cd src
 python data.py
 ```
-This creates 1,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits.
-### 3. Configuration
-Edit `src/config.py` to adjust:
-- Image dimensions (H=48, W_max=224)
-- Batch sizes (32 for local GTX 1650, 128 for Colab T4)
-- Training parameters
 ## 🎮 Usage
@@ -84,21 +107,29 @@ Edit `src/config.py` to adjust:
 ## 🔬 Technical Details
-### Model Architecture
-- **CNN Encoder**: Reduces image to sequence representation
-- **BiLSTM**: Processes sequential features
-- **Linear Output**: Maps to vocabulary size (including blank token)
 ### CTC Training
-- **Input**: Images resized to 48×224
 - **Output**: Character sequences (a-z, A-Z, 0-9)
-- **Loss**: CTCLoss with blank=0
-- **Decoding**: Greedy CTC decode
-### Data Format
-- **Images**: Grayscale, normalized tensors
 - **Labels**: CSV with filename and text label
-- **Batching**: Variable-length sequences handled by custom collate
 ## 📊 Performance Expectations

 ## 🏗️ Current Status
 ### ✅ Completed Components
+- **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits (8k train, 1k val)
 - **Configuration**: Centralized config with image dimensions and training parameters
+- **Vocabulary System**: Character encoding/decoding with CTC blank token support (63 classes)
 - **CTC Collate Function**: Proper batching for variable-length sequences
 - **CTC Decoding**: Greedy decode for inference
+- **PyTorch Dataset Class**: Image loading and preprocessing with proper cv2 resizing
+- **CRNN Model**: CNN encoder + BiLSTM + LayerNorm + linear output (working!)
+- **Training Loop**: Complete epoch-based training pipeline with validation
+- **Metrics & Plotting**: Training/validation loss tracking with beautiful visualizations
+- **Debugging Tools**: Comprehensive logging of logits, predictions, and model health
+### ✅ What's Working
+- **Training Pipeline**: Stable training loop with proper loss convergence
+- **Model Architecture**: CRNN produces correct output shapes (56×batch×63)
+- **Data Loading**: Proper image preprocessing and CTC batching
+- **Early Learning**: Model outputs first characters after 3 epochs (blank prob: 1.0→0.975)
+### ❌ What's Not Working Yet
+- **Accuracy**: Still very low, mostly single characters (`'t', 'tu'`)
+- **Sequence Length**: Not yet producing full CAPTCHA sequences
+- **Character Diversity**: Limited to a few characters, needs more training
+### 🎯 Training Status
+- **Current**: Epoch 3, basic character recognition starting
+- **Estimated**: 20-40 epochs needed for decent CAPTCHA accuracy
 ## 📁 Project Structure
 │       └── test/           # 10% of data
 ├── src/
 │   ├── config.py           # Configuration and hyperparameters
+│   ├── vocab.py            # Character vocabulary and CTC encoding/decoding
 │   ├── data.py             # Dataset generation script
 │   ├── collate.py          # CTC batching function
+│   ├── captcha_dataset.py  # PyTorch Dataset class
+│   ├── model_crnn.py       # CRNN model architecture
+│   └── plotting.py         # Training metrics and visualization
+├── train.py                # Main training script (✅ WORKING!)
+├── Metrics/                # Training plots and logs (auto-generated)
 ├── .gitignore              # Ignores dataset contents, keeps structure
 └── README.md               # This file
 ```
 pip install captcha pandas pillow
 ```
+### 2. Generate Training Dataset
 ```bash
 cd src
 python data.py
 ```
+This creates 10,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits.
+### 3. Start Training
+```bash
+python train.py
+```
+This starts the full training pipeline with automatic metrics generation.
+### 4. Monitor Progress
+Training will show:
+- Real-time loss and prediction samples
+- Automatic plot generation in `Metrics/` folder
+- Comprehensive training logs and summaries
 ## 🎮 Usage
 ## 🔬 Technical Details
+### Model Architecture (CRNN)
+- **CNN Encoder**: SmallCNN with stride=4, reduces W=224→56 timesteps
+- **BiLSTM**: 2-layer bidirectional LSTM (256 hidden, dropout=0.1)
+- **LayerNorm**: Stabilizes training before output layer
+- **Linear Output**: Maps to 63 classes (62 chars + 1 blank token)
+### Training Optimizations
+- **AdamW Optimizer**: lr=3e-4, weight_decay=1e-4
+- **Gradient Clipping**: max_norm=1.0 prevents exploding gradients
+- **Weight Initialization**: Small uniform weights (-1e-3, 1e-3) for stability
+- **Numeric Stability**: AMP disabled during initial training for stability
 ### CTC Training
+- **Input**: Images resized to 48×224 (height×width)
 - **Output**: Character sequences (a-z, A-Z, 0-9)
+- **Loss**: CTCLoss with blank=0, zero_infinity=True
+- **Decoding**: Greedy CTC decode with duplicate removal
+### Data Pipeline
+- **Images**: Grayscale, normalized to [0,1], proper cv2 resizing
 - **Labels**: CSV with filename and text label
+- **Batching**: Variable-length sequences with custom CTC collate function
+- **Debugging**: Real-time monitoring of logits, blank probability, predictions
 ## 📊 Performance Expectations

src/captcha_dataset.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import glob
+import cv2
+import pandas as pd
+import torch
+import os
+from src.config import cfg
+from dataclasses import dataclass
+@dataclass
+class CaptchaDataset(torch.utils.data.Dataset):
+    def __init__(self,folder:str):
+        self.data_root = cfg.data_root
+        df = pd.read_csv(f"{self.data_root}/{folder}/labels.csv")
+        self.data = []
+        for _,row in df.iterrows():
+            filename = row['filename']
+            label = row['label']
+            img_path = f"{self.data_root}/{folder}/{row['filename']}"
+            # Check if file actually exists
+            if os.path.exists(img_path):
+                self.data.append((img_path,label,folder))
+            else:
+                print(f"Warning: Image file not found: {img_path}")
+        print(f"Loaded {len(self.data)} valid images from {folder}")
+        self.img_dim = (cfg.W_max, cfg.H)  # cv2.resize expects (width, height)
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self,idx):
+        img_path, label_string,folder = self.data[idx]
+        # Load image with error checking
+        img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE if cfg.grayscale else cv2.IMREAD_COLOR)
+        if img is None:
+            raise ValueError(f"Failed to load image: {img_path}")
+        img = cv2.resize(img, self.img_dim)
+        img_tensor = torch.from_numpy(img).float()/255.0  # Normalize to [0,1]
+        img_tensor = img_tensor.unsqueeze(0)  # Add channel dimension
+        return img_tensor, label_string, img_path

src/config.py CHANGED Viewed

@@ -7,14 +7,27 @@ class Config:
     data_root: str = os.getenv("DATA_ROOT","Dataset_test\captchas")
     chars: str = string.ascii_letters + string.digits
-    H: int = 48
-    W_max: int = 224
-    grayscale: bool = True
-    total_stride: int = 4  #
-    batch_size: int = 32
     num_workers: int = 4
-    amp: bool = True
 cfg = Config()

     data_root: str = os.getenv("DATA_ROOT","Dataset_test\captchas")
     chars: str = string.ascii_letters + string.digits
+    # Image dimensions - increased for better character detail
+    H: int = 60  # Increased from 48 for more vertical detail
+    W_max: int = 256  # Increased from 224 for more time steps (T=64)
+    grayscale: bool = True
+    # Model architecture
+    total_stride: int = 4  # CNN width downsampling factor
+    # Training hyperparameters
+    batch_size: int = 32  # Local testing
+    batch_size_t4: int = 128  # Colab T4 recommendation
     num_workers: int = 4
+    amp: bool = True
+    # Learning rate and optimization
+    lr: float = 3e-4
+    weight_decay: float = 1e-4
+    # Training duration
+    epochs: int = 40  # For 100k dataset
+    epochs_test: int = 10  # For 1k test dataset
 cfg = Config()

src/data.py CHANGED Viewed

@@ -8,11 +8,11 @@ import pandas as pd
 # config
 DATASET_DIR = "Dataset_test/captchas"
 LABELS = "Dataset_test/labels.csv"
-NUM_IMAGES = 1000
 CHARS = string.ascii_letters + string.digits
 CAPTCHA_LEN_LOWER_LIMIT = 5
 CAPTCHA_LEN_UPPER_LIMIT = 7
-directories = [["train",0.8],["test",0.1],["val",0.1]]
 os.makedirs(DATASET_DIR, exist_ok=True)
 image = ImageCaptcha(width=160, height=60)

 # config
 DATASET_DIR = "Dataset_test/captchas"
 LABELS = "Dataset_test/labels.csv"
+NUM_IMAGES = 10000
 CHARS = string.ascii_letters + string.digits
 CAPTCHA_LEN_LOWER_LIMIT = 5
 CAPTCHA_LEN_UPPER_LIMIT = 7
+directories = [["train",0.8],["val",0.1],["test",0.1]]
 os.makedirs(DATASET_DIR, exist_ok=True)
 image = ImageCaptcha(width=160, height=60)

src/model_crnn.py ADDED Viewed

	@@ -0,0 +1,80 @@

+import torch
+import torch.nn as nn
+from src.config import cfg
+class SmallCNN(nn.Module):
+    """
+    Improved CNN with BatchNorm and residual connections.
+    Produces feature map with total stride 4 along width,
+    and compresses height to ~1 via pooling.
+    """
+    def __init__(self, in_ch=1) -> None:
+        super().__init__()
+        # First conv block: H,W -> H/2, W/2
+        self.conv1 = nn.Sequential(
+            nn.Conv2d(in_ch, 64, 3, padding=1),
+            nn.BatchNorm2d(64),
+            nn.ReLU(inplace=True),
+            nn.MaxPool2d(kernel_size=(2,2), stride=(2,2))  # stride 2x2
+        )
+        # Second conv block: maintain H/2, W/2 -> W/4
+        self.conv2 = nn.Sequential(
+            nn.Conv2d(64, 128, 3, padding=1),
+            nn.BatchNorm2d(128),
+            nn.ReLU(inplace=True),
+            nn.MaxPool2d(kernel_size=(1,2), stride=(1,2))  # height stride 1, width stride 2
+        )
+        # Residual block at 128 channels
+        self.residual = nn.Sequential(
+            nn.Conv2d(128, 128, 3, padding=1),
+            nn.BatchNorm2d(128),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(128, 128, 3, padding=1),
+            nn.BatchNorm2d(128)
+        )
+        self.residual_relu = nn.ReLU(inplace=True)
+        self.height_pool = nn.AdaptiveAvgPool2d((1, None))  # squeeze height to 1
+    def forward(self, x):
+        # First two conv blocks
+        f = self.conv1(x)           # [B, 64, H/2, W/2]
+        f = self.conv2(f)           # [B, 128, H/2, W/4]
+        # Residual connection
+        residual = f
+        f = self.residual(f)        # [B, 128, H/2, W/4]
+        f = f + residual            # Skip connection
+        f = self.residual_relu(f)   # [B, 128, H/2, W/4]
+        # Height pooling
+        f = self.height_pool(f)     # [B, 128, 1, W/4]
+        f = f.squeeze(2)            # [B, 128, W/4]
+        f = f.permute(2, 0, 1)      # [T(=W/4), B, 128]
+        return f
+class CRNN(nn.Module):
+    def __init__(self, vocab_size: int, in_ch: int = 1, hidden: int = 320, layers: int = 2, dropout: float = 0.05):
+        super().__init__()
+        self.cnn = SmallCNN(in_ch=in_ch)
+        self.rnn = nn.LSTM(input_size=128, hidden_size=hidden, num_layers=layers,
+                           bidirectional=True, dropout=dropout, batch_first=False)
+        self.norm = nn.LayerNorm(2*hidden)  # Add LayerNorm for stability
+        self.fc = nn.Linear(2*hidden, vocab_size)
+        # Initialize weights properly
+        self._init_weights()
+    def _init_weights(self):
+        # Initialize final linear layer with small weights
+        nn.init.xavier_uniform_(self.fc.weight)
+        nn.init.zeros_(self.fc.bias)
+    def forward(self, x):
+        seq = self.cnn(x)   # [T,B,C=128]
+        y, _ = self.rnn(seq)  # [T,B,2H]
+        y = self.norm(y)     # [T,B,2H] - Apply LayerNorm
+        logits = self.fc(y)   # [T,B,V]
+        return logits

src/plotting.py ADDED Viewed

	@@ -0,0 +1,107 @@

+import matplotlib.pyplot as plt
+import numpy as np
+from datetime import datetime
+import os
+class TrainingMetrics:
+    def __init__(self):
+        self.train_losses = []
+        self.val_losses = []
+        self.epochs = []
+        self.sample_predictions = []
+        self.sample_targets = []
+    def add_epoch(self, epoch, train_loss, val_loss):
+        self.epochs.append(epoch)
+        self.train_losses.append(train_loss)
+        self.val_losses.append(val_loss)
+    def add_predictions(self, predictions, targets):
+        self.sample_predictions.extend(predictions)
+        self.sample_targets.extend(targets)
+    def plot_losses(self, save_path="Metrics/training_losses.png"):
+        plt.figure(figsize=(10, 6))
+        plt.plot(self.epochs, self.train_losses, 'b-', label='Training Loss', linewidth=2)
+        plt.plot(self.epochs, self.val_losses, 'r-', label='Validation Loss', linewidth=2)
+        plt.xlabel('Epoch')
+        plt.ylabel('Loss')
+        plt.title('Training and Validation Loss Over Time')
+        plt.legend()
+        plt.grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(save_path, dpi=300, bbox_inches='tight')
+        plt.close()
+        print(f"Loss plot saved to: {save_path}")
+    def plot_loss_comparison(self, save_path="Metrics/loss_comparison.png"):
+        plt.figure(figsize=(12, 8))
+        # Main loss plot
+        plt.subplot(2, 2, 1)
+        plt.plot(self.epochs, self.train_losses, 'b-', label='Training Loss')
+        plt.plot(self.epochs, self.val_losses, 'r-', label='Validation Loss')
+        plt.xlabel('Epoch')
+        plt.ylabel('Loss')
+        plt.title('Training vs Validation Loss')
+        plt.legend()
+        plt.grid(True, alpha=0.3)
+        # Loss difference plot
+        plt.subplot(2, 2, 2)
+        loss_diff = [t - v for t, v in zip(self.train_losses, self.val_losses)]
+        plt.plot(self.epochs, loss_diff, 'g-', label='Train - Val Loss')
+        plt.xlabel('Epoch')
+        plt.ylabel('Loss Difference')
+        plt.title('Overfitting Indicator')
+        plt.legend()
+        plt.grid(True, alpha=0.3)
+        # Loss ratio plot
+        plt.subplot(2, 2, 3)
+        loss_ratio = [v/t if t > 0 else 0 for t, v in zip(self.train_losses, self.val_losses)]
+        plt.plot(self.epochs, loss_ratio, 'm-', label='Val/Train Loss Ratio')
+        plt.xlabel('Epoch')
+        plt.ylabel('Ratio')
+        plt.title('Validation/Training Loss Ratio')
+        plt.legend()
+        plt.grid(True, alpha=0.3)
+        # Loss improvement plot
+        plt.subplot(2, 2, 4)
+        train_improvement = [self.train_losses[0] - t for t in self.train_losses]
+        val_improvement = [self.val_losses[0] - v for v in self.val_losses]
+        plt.plot(self.epochs, train_improvement, 'b-', label='Training Improvement')
+        plt.plot(self.epochs, val_improvement, 'r-', label='Validation Improvement')
+        plt.xlabel('Epoch')
+        plt.ylabel('Loss Improvement')
+        plt.title('Loss Improvement from Start')
+        plt.legend()
+        plt.grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(save_path, dpi=300, bbox_inches='tight')
+        plt.close()
+        print(f"Loss comparison plot saved to: {save_path}")
+    def save_metrics(self, save_path="Metrics/training_metrics.txt"):
+        with open(save_path, 'w') as f:
+            f.write("CAPTCHA OCR Training Metrics\n")
+            f.write("=" * 50 + "\n\n")
+            f.write(f"Training completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
+            f.write(f"Total epochs: {len(self.epochs)}\n\n")
+            f.write("Loss Summary:\n")
+            f.write("-" * 20 + "\n")
+            f.write(f"Final training loss: {self.train_losses[-1]:.4f}\n")
+            f.write(f"Final validation loss: {self.val_losses[-1]:.4f}\n")
+            f.write(f"Best training loss: {min(self.train_losses):.4f}\n")
+            f.write(f"Best validation loss: {min(self.val_losses):.4f}\n")
+            f.write(f"Training loss improvement: {self.train_losses[0] - self.train_losses[-1]:.4f}\n")
+            f.write(f"Validation loss improvement: {self.val_losses[0] - self.val_losses[-1]:.4f}\n\n")
+            f.write("Sample Predictions:\n")
+            f.write("-" * 20 + "\n")
+            for i, (pred, target) in enumerate(zip(self.sample_predictions[:10], self.sample_targets[:10])):
+                f.write(f"Sample {i+1}: Predicted='{pred}', Target='{target}'\n")

src/test.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import torch
+print(f"PyTorch version: {torch.__version__}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+if torch.cuda.is_available():
+    print(f"CUDA version: {torch.version.cuda}")
+    # GPU detection and info
+    gpu_count = torch.cuda.device_count()
+    print(f"Number of GPUs: {gpu_count}")
+    for i in range(gpu_count):
+        gpu_name = torch.cuda.get_device_name(i)
+        gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3  # Convert to GB
+        print(f"GPU {i}: {gpu_name}")
+        print(f"GPU {i} Memory: {gpu_memory:.1f} GB")
+    # Current GPU
+    current_gpu = torch.cuda.current_device()
+    print(f"Current GPU: {current_gpu}")
+    # Test GPU tensor operations
+    print("\nTesting GPU operations...")
+    try:
+        # Create a test tensor on GPU
+        test_tensor = torch.randn(1000, 1000).cuda()
+        print(f"✓ Successfully created tensor on GPU: {test_tensor.shape}")
+        print(f"✓ Tensor device: {test_tensor.device}")
+        # Test basic operations
+        result = torch.mm(test_tensor, test_tensor.T)
+        print(f"✓ Matrix multiplication successful: {result.shape}")
+        # Memory usage
+        allocated = torch.cuda.memory_allocated() / 1024**2  # MB
+        cached = torch.cuda.memory_reserved() / 1024**2     # MB
+        print(f"✓ GPU Memory allocated: {allocated:.1f} MB")
+        print(f"✓ GPU Memory cached: {cached:.1f} MB")
+        # Clean up
+        del test_tensor, result
+        torch.cuda.empty_cache()
+        print("✓ GPU memory cleaned up successfully")
+    except Exception as e:
+        print(f"✗ GPU test failed: {e}")
+else:
+    print("CUDA not available - PyTorch will use CPU only")

train.py ADDED Viewed

	@@ -0,0 +1,206 @@

+import os
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from src.config import cfg
+from src.collate import ctc_collate
+from src.captcha_dataset import CaptchaDataset
+from src.vocab import vocab_size, ctc_greedy_decode, decode_indices, itos
+from src.plotting import TrainingMetrics
+from src.model_crnn import CRNN
+import difflib
+def cer(pred: str, tgt: str) -> float:
+    """Approximate Character Error Rate using difflib."""
+    sm = difflib.SequenceMatcher(a=pred, b=tgt)
+    return 1 - sm.ratio()
+def main():
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    in_ch = 1 if cfg.grayscale else 3
+    print("Creating datasets...")
+    train_ds = CaptchaDataset("train")
+    val_ds = CaptchaDataset("val")
+    # Debug: Check vocabulary
+    print(f"Vocabulary size: {vocab_size()}")
+    print(f"First 10 characters: {list(cfg.chars)[:10]}")
+    print(f"First 10 itos: {itos[:10]}")
+    print(f"Training dataset size: {len(train_ds)}")
+    print(f"Validation dataset size: {len(val_ds)}")
+    train_dl = DataLoader(train_ds, batch_size=cfg.batch_size, shuffle=True,
+                          num_workers=cfg.num_workers, pin_memory=True,
+                          drop_last=True, collate_fn=ctc_collate)
+    val_dl = DataLoader(val_ds, batch_size=cfg.batch_size, shuffle=False,
+                        num_workers=cfg.num_workers, pin_memory=True,
+                        drop_last=True, collate_fn=ctc_collate)
+    model = CRNN(vocab_size=vocab_size()).to(device)
+    # Initialize final layer with small weights for stability
+    with torch.no_grad():
+        torch.nn.init.uniform_(model.fc.weight, -1e-3, 1e-3)
+        torch.nn.init.zeros_(model.fc.bias)
+    criterion = nn.CTCLoss(blank=0, zero_infinity=True)
+    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
+    scaler = torch.amp.GradScaler('cuda', enabled=False)  # Disable AMP for stability
+    # Epoch-based training with scheduler
+    epochs = 20  # Increased for OneCycleLR
+    scheduler = torch.optim.lr_scheduler.OneCycleLR(
+        optimizer, max_lr=3e-4, steps_per_epoch=len(train_dl), epochs=epochs
+    )
+    print(f"\nStarting training for {epochs} epochs...")
+    metrics = TrainingMetrics()
+    for epoch in range(epochs):
+        # Training phase
+        model.train()
+        total_train_loss = 0
+        num_batches = 0
+        print(f"\nEpoch {epoch+1}/{epochs}")
+        print("Training...")
+        for batch_idx, batch in enumerate(train_dl):
+            images, targets_flat, target_lengths, input_lengths, paths = batch
+            # CTC sanity checks (first batch of each epoch)
+            if batch_idx == 0:
+                assert targets_flat.numel() == target_lengths.sum().item(), "Target lengths mismatch"
+                assert torch.all(target_lengths <= input_lengths), "Target longer than input"
+                print(f"    Batch 0 sanity: input_lens={input_lengths[:5].tolist()}, target_lens={target_lengths[:5].tolist()}")
+                print(f"    Image stats: min={images.min():.3f}, max={images.max():.3f}, mean={images.mean():.3f}")
+            images = images.to(device)
+            targets_flat = targets_flat.to(device)
+            target_lengths = target_lengths.to(device)
+            input_lengths = input_lengths.to(device)
+            optimizer.zero_grad(set_to_none=True)
+            with torch.amp.autocast('cuda', enabled=False):
+                logits = model(images)
+                log_probs = logits.log_softmax(dim=-1)
+                loss = criterion(log_probs, targets_flat, input_lengths, target_lengths)
+            loss.backward()
+            # Gradient clipping to prevent exploding gradients
+            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+            optimizer.step()
+            scheduler.step()  # OneCycleLR step per batch
+            total_train_loss += loss.item()
+            num_batches += 1
+            # Progress update every 50 batches
+            if batch_idx % 50 == 0:
+                print(f"  Batch {batch_idx}/{len(train_dl)} - Loss: {loss.item():.4f}")
+        avg_train_loss = total_train_loss / num_batches
+        # Validation phase
+        model.eval()
+        total_val_loss = 0
+        num_val_batches = 0
+        print("Validating...")
+        with torch.no_grad():
+            for batch in val_dl:
+                images, targets_flat, target_lengths, input_lengths, paths = batch
+                images = images.to(device)
+                targets_flat = targets_flat.to(device)
+                target_lengths = target_lengths.to(device)
+                input_lengths = input_lengths.to(device)
+                logits = model(images)
+                log_probs = logits.log_softmax(dim=-1)
+                loss = criterion(log_probs, targets_flat, input_lengths, target_lengths)
+                total_val_loss += loss.item()
+                num_val_batches += 1
+        avg_val_loss = total_val_loss / num_val_batches
+        print(f"Epoch {epoch+1}/{epochs} Summary:")
+        print(f"  Train Loss: {avg_train_loss:.4f}")
+        print(f"  Val Loss: {avg_val_loss:.4f}")
+        metrics.add_epoch(epoch+1, avg_train_loss, avg_val_loss)
+        # Test some predictions
+        if epoch % 2 == 0:  # Every 2 epochs
+            print("Sample predictions:")
+            with torch.no_grad():
+                test_batch = next(iter(val_dl))
+                test_images = test_batch[0][:5].to(device)  # First 5 images
+                print(f"    Input image shape: {test_images.shape}")
+                print(f"    Input image min/max: {test_images.min():.4f}/{test_images.max():.4f}")
+                test_logits = model(test_images)
+                # Debug: Check logits shape and values
+                print(f"    Logits shape: {test_logits.shape}")
+                print(f"    Expected logits shape: [W//stride, B, V] = [{cfg.W_max}//{cfg.total_stride}, 5, 63] = [{cfg.W_max//cfg.total_stride}, 5, 63]")
+                print(f"    Logits min/max: {test_logits.min():.4f}/{test_logits.max():.4f}")
+                # Check raw predictions and blank probability (from softmax)
+                raw_preds = test_logits.argmax(dim=-1)
+                probs = test_logits.log_softmax(-1).exp()
+                avg_blank_prob = probs[..., 0].mean().item()
+                print(f"    Raw predictions shape: {raw_preds.shape}")
+                print(f"    Raw predictions sample: {raw_preds[:10, 0].tolist()}")
+                print(f"    Avg blank prob (softmax): {avg_blank_prob:.4f}")
+                print(f"    Blank probability (argmax): {(raw_preds == 0).float().mean():.4f}")
+                test_preds = ctc_greedy_decode(test_logits)
+                # Decode the target integers back to text strings with proper offsets
+                targets_flat, target_lengths = test_batch[1], test_batch[2]
+                offsets = torch.zeros(len(target_lengths), dtype=torch.long)
+                offsets[1:] = torch.cumsum(target_lengths[:-1], dim=0)
+                test_targets = []
+                for i in range(min(5, len(target_lengths))):
+                    s = offsets[i].item()
+                    e = s + target_lengths[i].item()
+                    indices = targets_flat[s:e].tolist()
+                    test_targets.append(decode_indices(indices))
+                # Calculate CER for this batch
+                batch_cer = sum(cer(p, t) for p, t in zip(test_preds, test_targets)) / len(test_targets)
+                print(f"    Val CER (approx): {batch_cer:.3f}")
+                for i, (pred, target) in enumerate(zip(test_preds, test_targets)):
+                    print(f"    {i}: Predicted='{pred}', Target='{target}'")
+                metrics.add_predictions(test_preds, test_targets)
+    print("\nTraining complete!")
+    print("\nGenerating training metrics and plots...")
+    os.makedirs("Metrics", exist_ok=True)
+    metrics.plot_losses()
+    metrics.plot_loss_comparison()
+    metrics.save_metrics()
+    # Final validation test
+    model.eval()
+    with torch.no_grad():
+        images, targets_flat, target_lengths, input_lengths, paths = next(iter(val_dl))
+        images = images.to(device)
+        logits = model(images)
+        preds = ctc_greedy_decode(logits)
+        print("\nFinal validation predictions:")
+        for i, pred in enumerate(preds[:10]):
+            print(f"  {i}: {pred}")
+if __name__ == "__main__":
+    os.makedirs("checkpoints", exist_ok=True)
+    main()

train_sanity.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import os
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from src.config import cfg
+from src.collate import ctc_collate
+from src.captcha_dataset import CaptchaDataset
+from src.vocab import vocab_size, ctc_greedy_decode
+from src.model_crnn import CRNN
+def main():
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    in_ch = 1 if cfg.grayscale else 3
+    print("Creating datasets...")
+    train_ds = CaptchaDataset("train")
+    val_ds = CaptchaDataset("val")
+    print(f"Training dataset size: {len(train_ds)}")
+    print(f"Validation dataset size: {len(val_ds)}")
+    train_dl = DataLoader(train_ds, batch_size=cfg.batch_size, shuffle=True,
+                          num_workers=cfg.num_workers, pin_memory=True,
+                          drop_last=True, collate_fn=ctc_collate)
+    val_dl = DataLoader(val_ds, batch_size=cfg.batch_size, shuffle=False,
+                        num_workers=cfg.num_workers, pin_memory=True,
+                        drop_last=True, collate_fn=ctc_collate)
+    # # Test training data
+    # print("\nTesting training data...")
+    # for batch in train_dl:
+    #     images, targets_flat, target_lengths, input_lengths, paths = batch
+    #     print(f"Training batch shape: {images.shape}")
+    #     print(f"Sample labels: {targets_flat[:10]}")
+    #     break
+    # # Test validation data
+    # print("\nTesting validation data...")
+    # try:
+    #     for batch in val_dl:
+    #         images, targets_flat, target_lengths, input_lengths, paths = batch
+    #         print(f"Validation batch shape: {images.shape}")
+    #         print(f"Sample labels: {targets_flat[:10]}")
+    #         break
+    # except Exception as e:
+    #     print(f"Error in validation data: {e}")
+    #     print("This suggests there are issues with some validation images")
+    model = CRNN(vocab_size=vocab_size()).to(device)
+    criterion = nn.CTCLoss(blank=0, zero_infinity=True)
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+    scaler = torch.amp.GradScaler('cuda', enabled=cfg.amp and device.type == "cuda")
+    model.train()
+    steps = 200
+    it = iter(train_dl)
+    for step in range(1,steps+1):
+        try:
+            images, targets_flat, target_lengths, input_lengths, paths = next(it)
+        except StopIteration:
+            it = iter(train_dl)
+            images, targets_flat, target_lengths, input_lengths, paths = next(it)
+        images = images.to(device)
+        targets_flat = targets_flat.to(device)
+        target_lengths = target_lengths.to(device)
+        input_lengths = input_lengths.to(device)
+        optimizer.zero_grad(set_to_none=True)
+        with torch.amp.autocast('cuda', enabled=scaler.is_enabled()):
+            logits = model(images)
+            log_probs = logits.log_softmax(dim=-1)
+            loss = criterion(log_probs,targets_flat,input_lengths,target_lengths)
+        scaler.scale(loss).backward()
+        scaler.step(optimizer)
+        scaler.update()
+        if step % 20 == 0:
+            print(f"step {step}/{steps} - loss {loss.item():.4f}")
+    model.eval()
+    with torch.no_grad():
+        images, targets_flat, target_lengths, input_lengths, paths = next(iter(val_dl))
+        images = images.to(device)
+        logits = model(images)
+        preds = ctc_greedy_decode(logits)
+    print("Sanity check complete")
+if __name__ == "__main__":
+    os.makedirs("checkpoints", exist_ok=True)
+    main()