--- tags: - ocr - pytorch license: mit datasets: - hammer888/captcha-data metrics: - accuracy - cer pipeline_tag: image-to-text library_name: transformers ---

# ✨ DeepCaptcha-Conv-Transformer: Sequential Vision for OCR ### Convolutional Transformer Base [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) [![Python 3.13+](https://img.shields.io/badge/python-3.13+-blue.svg)](https://www.python.org/downloads/release/python-3130/) [![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-orange)](https://huggingface.co/Graf-J/captcha-crnn-finetuned) --- Captcha Example

*Advanced sequence recognition using a Convolutional Transformer Encoder with Connectionist Temporal Classification (CTC) loss.*

--- ## 📋 Model Details - **Task:** Alphanumeric Captcha Recognition - **Input:** Images - **Output:** String sequences (Length 1–8 characters) - **Vocabulary:** Alphanumeric (`a-z`, `A-Z`, `0-9`) - **Architecture:** Convolutional Transformer Encoder (CNN + Transformer Encoder) --- ## 📊 Performance Metrics This project features four models exploring the trade-offs between recurrent (LSTM) and attention-based (Transformer) architectures, as well as the effects of fine-tuning on capchas generated by the [Python Captcha Library](https://captcha.lepture.com/). | Metric | **CRNN (Base)** | **CRNN (Finetuned)** | **Conv-Transformer (Base)** | **Conv-Transformer (Finetuned)** | |--------|-----------------|----------------------|-----------------------------|----------------------------------| | Architecture | CRNN | CRNN | Convolutional Transformer | Convolutional Transformer | | Training Data | [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data) | [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data)
[Python Captcha Library](https://captcha.lepture.com/) | [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data) | [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data)
[Python Captcha Library](https://captcha.lepture.com/) | | # Parameters | **3,570,943** | **3,570,943** | 12,279,551 | 12,279,551 | | Model Size | **14.3 MB** | **14.3 MB** | 51.7 MB | 51.7 MB | | Sequence Accuracy
([hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data)) | 96.81% | 92.98% | **97.38%** | 95.36% | | Character Error Rate (CER)
([hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data)) | 0.70% | 1.59% | **0.57%** | 1.03% | | Sequence Accuracy
([Python Captcha Library](https://captcha.lepture.com/)) | 9.65% | 86.20% | 11.59% | **88.42%** | | Character Error Rate (CER)
([Python Captcha Library](https://captcha.lepture.com/)) | 43.98% | 2.53% | 38.63% | **2.08%** | | Throughput (img/sec) | 447.26 | 447.26 | **733.00** | **733.00** | | Compute Hardware | NVIDIA RTX A6000 | NVIDIA RTX A6000 | NVIDIA RTX A6000 | NVIDIA RTX A6000 | | Link | [Graf-J/captcha-crnn-base](https://huggingface.co/Graf-J/captcha-crnn-base) | [Graf-J/captcha-crnn-finetuned](https://huggingface.co/Graf-J/captcha-crnn-finetuned) | [Graf-J/captcha-conv-transformer-base](https://huggingface.co/Graf-J/captcha-conv-transformer-base) | [Graf-J/captcha-conv-transformer-finetuned](https://huggingface.co/Graf-J/captcha-conv-transformer-finetuned) --- ## 🧪 Try It With Sample Images The following are images sampled of the test set of the [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data) dataset. Click any image below to download it and test the model locally.

--- ## 🚀 Quick Start (Pipeline - Recommended) The easiest way to perform inference is using the custom Hugging Face pipeline. ```python from transformers import pipeline from PIL import Image # Initialize the pipeline pipe = pipeline( task="captcha-recognition", model="Graf-J/captcha-conv-transformer-base", trust_remote_code=True ) # Load and predict img = Image.open("path/to/image.png") result = pipe(img) print(f"Decoded Text: {result['prediction']}") ``` ## 🔬 Advanced Usage (Raw Logits & Custom Decoding) Use this method if you need access to the raw logits or internal hidden states. ```python import torch from PIL import Image from transformers import AutoModel, AutoProcessor # Load Model & Custom Processor repo_id = "Graf-J/captcha-conv-transformer-base" processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True) model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) model.eval() # Load and process image img = Image.open("path/to/image.png") inputs = processor(img) # Inference with torch.no_grad(): outputs = model(inputs["pixel_values"]) logits = outputs.logits # Decode the prediction via CTC logic prediction = processor.batch_decode(logits)[0] print(f"Prediction: '{prediction}'") ``` --- ## ⚙️ Training The base model was trained on a refined version of the [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data) (1,365,874 images). This dataset underwent a specialized cleaning process where multiple pre-trained models were used to identify and prune inconsistent data. Specifically, images where models were "confidently incorrect" regarding casing (upper/lower-case errors) were removed to ensure high-fidelity ground truth for the final training run. ### **Parameters** - **Optimizer:** Adam (lr=0.0005) - **Scheduler:** ReduceLROnPlateau (factor=0.5, patience=3) - **Batch Size:** 128 - **Loss Function:** CTCLoss - **Augmentations:** ElasticTransform, Random Rotation, Grayscale Resize --- ## 🔍 Error Analysis The following confusion matrices illustrate the character-level performance across the alphanumeric vocabulary for the test dataset of the images generated via Python. ### **Full Confusion Matrix** ![Full-Confusion-Matrix](images/confusion-matrix.png) ### **Misclassification Deep Dive** This matrix highlights only the misclassification patterns, stripping away correct predictions to visualize which character pairs (such as '0' vs 'O' or '1' vs 'l') the model most frequently confuses. While the dataset underwent a specialized cleaning process to minimize noisy labels, the confusion matrix reveals a residual pattern of misclassifications between visually similar upper and lowercase characters. ![Full-Confusion-Matrix](images/confusion-matrix-no-diagonal.png) --- ## ⚖️ **License & Citation** This project is licensed under the **MIT License**. If you use this model in your research, portfolio, or applications, please attribute the author.