✨ DeepCaptcha-CRNN: Sequential Vision for OCR

CRNN Fine-Tuned

License: MIT Python 3.13+ Hugging Face Model


Captcha Example

Advanced sequence recognition using a Convolutional Recurrent Neural Network (CRNN) with Connectionist Temporal Classification (CTC) loss.


πŸ“‹ Model Details

  • Task: Alphanumeric Captcha Recognition
  • Input: Images
  • Output: String sequences (Length 1–8 characters)
  • Vocabulary: Alphanumeric (a-z, A-Z, 0-9)
  • Architecture: CRNN (CNN + Bi-LSTM)

πŸ“Š Performance Metrics

This project features four models exploring the trade-offs between recurrent (LSTM) and attention-based (Transformer) architectures, as well as the effects of fine-tuning on capchas generated by the Python Captcha Library.

Metric CRNN (Base) CRNN (Finetuned) Conv-Transformer (Base) Conv-Transformer (Finetuned)
Architecture CRNN CRNN Convolutional Transformer Convolutional Transformer
Training Data hammer888/captcha-data hammer888/captcha-data
Python Captcha Library
hammer888/captcha-data hammer888/captcha-data
Python Captcha Library
# Parameters 3,570,943 3,570,943 12,279,551 12,279,551
Model Size 14.3 MB 14.3 MB 51.7 MB 51.7 MB
Sequence Accuracy
(hammer888/captcha-data)
96.81% 92.98% 97.38% 95.36%
Character Error Rate (CER)
(hammer888/captcha-data)
0.70% 1.59% 0.57% 1.03%
Sequence Accuracy
(Python Captcha Library)
9.65% 86.20% 11.59% 88.42%
Character Error Rate (CER)
(Python Captcha Library)
43.98% 2.53% 38.63% 2.08%
Throughput (img/sec) 447.26 447.26 733.00 733.00
Compute Hardware NVIDIA RTX A6000 NVIDIA RTX A6000 NVIDIA RTX A6000 NVIDIA RTX A6000
Link Graf-J/captcha-crnn-base Graf-J/captcha-crnn-finetuned Graf-J/captcha-conv-transformer-base Graf-J/captcha-conv-transformer-finetuned

πŸ§ͺ Try It With Sample Images

The following are images sampled of the test set of the hammer888/captcha-data dataset. Click any image below to download it and test the model locally.


πŸš€ Quick Start (Pipeline - Recommended)

The easiest way to perform inference is using the custom Hugging Face pipeline.

from transformers import pipeline
from PIL import Image

# Initialize the pipeline
pipe = pipeline(
    task="captcha-recognition", 
    model="Graf-J/captcha-crnn-finetuned", 
    trust_remote_code=True
)

# Load and predict
img = Image.open("path/to/image.png")
result = pipe(img)
print(f"Decoded Text: {result['prediction']}")

πŸ”¬ Advanced Usage (Raw Logits & Custom Decoding)

Use this method if you need access to the raw logits or internal hidden states.

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

# Load Model & Custom Processor
repo_id = "Graf-J/captcha-crnn-finetuned"
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)

model.eval()

# Load and process image
img = Image.open("path/to/image.png")
inputs = processor(img) 

# Inference
with torch.no_grad():
    outputs = model(inputs["pixel_values"])
    logits = outputs.logits

# Decode the prediction via CTC logic
prediction = processor.batch_decode(logits)[0]
print(f"Prediction: '{prediction}'")

🎨 Generate Your Own CAPTCHAs

You can generate synthetic CAPTCHA images using the Python Captcha Library. To reproduce the same style, download the Nunito font and pass it to the ImageCaptcha class as shown below.

pip install captcha matplotlib
from transformers import pipeline
from PIL import Image
import matplotlib.pyplot as plt
from captcha.image import ImageCaptcha

# Define CAPTCHA text to be generated (a-z A-Z 0-9 and length between 1 and 8)
captcha_text = "aZ93eiL"
assert captcha_text.isalnum() and 1 <= len(captcha_text) <= 8

# Generate CAPTCHA image
generator = ImageCaptcha(fonts=["path/to/Nunito.ttf"])
image_data = generator.generate(captcha_text)
image = Image.open(image_data)

# Initialize the pipeline
pipe = pipeline(
    task="captcha-recognition",
    model="Graf-J/captcha-crnn-finetuned",
    trust_remote_code=True
)

# Predict
result = pipe(image)["prediction"]

# Display result
plt.imshow(image)
plt.title(f"Model Prediction: {result}", fontsize=14, fontweight='bold', pad=12)
plt.axis('off')
plt.show()
Captcha Prediction

βš™οΈ Training

The model was developed intwo distinct stages:

  • Pre-training: The base model was trained on a refined version of the hammer888/captcha-data (1,365,874 images). This dataset underwent a specialized cleaning process where multiple pre-trained models were used to identify and prune inconsistent data. Specifically, images where models were "confidently incorrect" regarding casing (upper/lower-case errors) were removed to ensure high-fidelity ground truth for the final training run.
  • Fine-tuning: To further improve robustness and handle modern variations, the model was fine-tuned on 200,000 generated images created specifically for this project using the Python Captcha Library.

Parameters

  • Optimizer: Adam (lr=0.002)
  • Scheduler: ReduceLROnPlateau (factor=0.5, patience=3)
  • Batch Size: 128
  • Loss Function: CTCLoss
  • Augmentations: ElasticTransform, Random Rotation, Grayscale Resize

πŸ” Error Analysis

The following confusion matrices illustrate the character-level performance across the alphanumeric vocabulary for the test dataset of the images generated via Python.

Full Confusion Matrix

Full-Confusion-Matrix

Misclassification Deep Dive

This matrix highlights only the misclassification patterns, stripping away correct predictions to visualize which character pairs (such as '0' vs 'O' or '1' vs 'l') the model most frequently confuses. Full-Confusion-Matrix


βš–οΈ License & Citation

This project is licensed under the MIT License. If you use this model in your research, portfolio, or applications, please attribute the author.

Downloads last month
194
Safetensors
Model size
3.57M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Graf-J/captcha-crnn-finetuned

Collection including Graf-J/captcha-crnn-finetuned