✨ DeepCaptcha-Conv-Transformer: Sequential Vision for OCR

Convolutional Transformer Base

🚀 Live Demo:
👉 Try the Model in Your Browser

Advanced sequence recognition using a Convolutional Transformer Encoder with Connectionist Temporal Classification (CTC) loss.

📋 Model Details

Task: Alphanumeric Captcha Recognition
Input: Images
Output: String sequences (Length 1–8 characters)
Vocabulary: Alphanumeric (a-z, A-Z, 0-9)
Architecture: Convolutional Transformer Encoder (CNN + Transformer Encoder)

📊 Performance Metrics

This project features four models exploring the trade-offs between recurrent (LSTM) and attention-based (Transformer) architectures, as well as the effects of fine-tuning on capchas generated by the Python Captcha Library.

Metric	CRNN (Base)	CRNN (Finetuned)	Conv-Transformer (Base)	Conv-Transformer (Finetuned)
Architecture	CRNN	CRNN	Convolutional Transformer	Convolutional Transformer
Training Data	hammer888/captcha-data	hammer888/captcha-data Python Captcha Library	hammer888/captcha-data	hammer888/captcha-data Python Captcha Library
# Parameters	3,570,943	3,570,943	12,279,551	12,279,551
Model Size	14.3 MB	14.3 MB	51.7 MB	51.7 MB
Sequence Accuracy (hammer888/captcha-data)	96.81%	92.98%	97.38%	95.36%
Character Error Rate (CER) (hammer888/captcha-data)	0.70%	1.59%	0.57%	1.03%
Sequence Accuracy (Python Captcha Library)	9.65%	86.20%	11.59%	88.42%
Character Error Rate (CER) (Python Captcha Library)	43.98%	2.53%	38.63%	2.08%
Throughput (img/sec)	447.26	447.26	733.00	733.00
Compute Hardware	NVIDIA RTX A6000	NVIDIA RTX A6000	NVIDIA RTX A6000	NVIDIA RTX A6000
Link	Graf-J/captcha-crnn-base	Graf-J/captcha-crnn-finetuned	Graf-J/captcha-conv-transformer-base	Graf-J/captcha-conv-transformer-finetuned

🧪 Try It With Sample Images

The following are images sampled of the test set of the hammer888/captcha-data dataset. Click any image below to download it and test the model locally.

🚀 Quick Start (Pipeline - Recommended)

The easiest way to perform inference is using the custom Hugging Face pipeline.

from transformers import pipeline
from PIL import Image

# Initialize the pipeline
pipe = pipeline(
    task="captcha-recognition", 
    model="Graf-J/captcha-conv-transformer-base", 
    trust_remote_code=True
)

# Load and predict
img = Image.open("path/to/image.png")
result = pipe(img)
print(f"Decoded Text: {result['prediction']}")

🔬 Advanced Usage (Raw Logits & Custom Decoding)

Use this method if you need access to the raw logits or internal hidden states.

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

# Load Model & Custom Processor
repo_id = "Graf-J/captcha-conv-transformer-base"
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)

model.eval()

# Load and process image
img = Image.open("path/to/image.png")
inputs = processor(img) 

# Inference
with torch.no_grad():
    outputs = model(inputs["pixel_values"])
    logits = outputs.logits

# Decode the prediction via CTC logic
prediction = processor.batch_decode(logits)[0]
print(f"Prediction: '{prediction}'")

⚙️ Training

The base model was trained on a refined version of the hammer888/captcha-data (1,365,874 images). This dataset underwent a specialized cleaning process where multiple pre-trained models were used to identify and prune inconsistent data. Specifically, images where models were "confidently incorrect" regarding casing (upper/lower-case errors) were removed to ensure high-fidelity ground truth for the final training run.

Parameters

Optimizer: Adam (lr=0.0005)
Scheduler: ReduceLROnPlateau (factor=0.5, patience=3)
Batch Size: 128
Loss Function: CTCLoss
Augmentations: ElasticTransform, Random Rotation, Grayscale Resize

🔍 Error Analysis

The following confusion matrices illustrate the character-level performance across the alphanumeric vocabulary for the test dataset of the images generated via Python.

Full Confusion Matrix

Misclassification Deep Dive

This matrix highlights only the misclassification patterns, stripping away correct predictions to visualize which character pairs (such as '0' vs 'O' or '1' vs 'l') the model most frequently confuses. While the dataset underwent a specialized cleaning process to minimize noisy labels, the confusion matrix reveals a residual pattern of misclassifications between visually similar upper and lowercase characters.

⚖️ License & Citation

This project is licensed under the MIT License. If you use this model in your research, portfolio, or applications, please attribute the author.

Downloads last month: 9

Safetensors

Model size

12.9M params

Tensor type

F32

Dataset used to train Graf-J/captcha-conv-transformer-base

Space using Graf-J/captcha-conv-transformer-base 1

Collection including Graf-J/captcha-conv-transformer-base

DeepCaptcha

Collection

A high-performance Captcha recognition suite featuring Convolutional LSTM and Transformer-based architectures. Trained on 1.3M base images • 5 items • Updated Feb 28 • 3