Graf-J commited on Feb 23

Commit

fa98216

verified ·

1 Parent(s): 802e393

Initial Commit

Browse files

Files changed (44) hide show

README.md +212 -0
__pycache__/configuration_captcha.cpython-313.pyc +0 -0
__pycache__/modeling_captcha.cpython-313.pyc +0 -0
__pycache__/processing_captcha.cpython-313.pyc +0 -0
config.json +26 -0
configuration_captcha.py +21 -0
images/3eplzv.jpg +0 -0
images/46CN5W.jpg +0 -0
images/5820.jpg +0 -0
images/6521.jpg +0 -0
images/67qas.jpg +0 -0
images/75ke.jpg +0 -0
images/8JKM.jpg +0 -0
images/8jpwt0.jpg +0 -0
images/B1QAZ6.jpg +0 -0
images/CAPTCHA.png +0 -0
images/CCX8.jpg +0 -0
images/EPOD.jpg +0 -0
images/ER6Y.jpg +0 -0
images/EWSP.jpg +0 -0
images/GIOGp.jpg +0 -0
images/HCDS.jpg +0 -0
images/JBWkEs.jpg +0 -0
images/KKh8Q.jpg +0 -0
images/MFMH.jpg +0 -0
images/NJSEX.jpg +0 -0
images/R6AB.jpg +0 -0
images/TVHF.jpg +0 -0
images/Vb4cG.jpg +0 -0
images/XaNqQx.jpg +0 -0
images/YULM.jpg +0 -0
images/abfsh.jpg +0 -0
images/b6yc.jpg +0 -0
images/bCWaLR.jpg +0 -0
images/confusion-matrix-no-diagonal.png +0 -0
images/confusion-matrix.png +0 -0
images/d3no.jpg +0 -0
images/iq1sZo.jpg +0 -0
images/kJtOfk.jpg +0 -0
model.safetensors +3 -0
modeling_captcha.py +99 -0
pipeline.py +19 -0
processing_captcha.py +51 -0
processor_config.json +7 -0

README.md ADDED Viewed

	@@ -0,0 +1,212 @@

+---
+tags:
+  - ocr
+  - pytorch
+license: mit
+datasets:
+  - hammer888/captcha-data
+metrics:
+  - accuracy
+  - cer
+pipeline_tag: image-to-text
+library_name: transformers
+---
+<div align="center">
+# ✨ DeepCaptcha-Conv-Transformer: Sequential Vision for OCR
+### Convolutional Transformer Base
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
+[![Python 3.13+](https://img.shields.io/badge/python-3.13+-blue.svg)](https://www.python.org/downloads/release/python-3130/)
+[![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-orange)](https://huggingface.co/Graf-J/captcha-crnn-finetuned)
+---
+<img src="images/CAPTCHA.png" alt="Captcha Example" width="500">
+*Advanced sequence recognition using a Convolutional Transformer Encoder with Connectionist Temporal Classification (CTC) loss.*
+</div>
+---
+## 📋 Model Details
+- **Task:** Alphanumeric Captcha Recognition
+- **Input:** Images
+- **Output:** String sequences (Length 1–8 characters)
+- **Vocabulary:** Alphanumeric (`a-z`, `A-Z`, `0-9`)
+- **Architecture:** Convolutional Transformer Encoder (CNN + Transformer Encoder)
+---
+## 📊 Performance Metrics
+### **Test Set Results**
+| Dataset | Sequence Accuracy | Character Error Rate (CER) |
+| --- | --- | --- |
+| **[hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data)** | `97.38%` | `0.57%` |
+### **Hardware & Efficiency**
+| Metric | Value |
+| --- | --- |
+| **Model Parameters** | `12,279,551` |
+| **Model Size (Disk)** | `51.7 MB` |
+| **Throughput (Images/sec)** | `733.00 – 751.11` |
+| **Compute Hardware** | **NVIDIA RTX A6000** |
+---
+## 🧪 Try It With Sample Images
+The following are images sampled of the test set of the [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data) dataset. Click any image below to download it and test the model locally.
+<div align="center">
+<table>
+<tr>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/46CN5W.jpg"><img src="images/46CN5W.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/5820.jpg"><img src="images/5820.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/6521.jpg"><img src="images/6521.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/abfsh.jpg"><img src="images/abfsh.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/67qas.jpg"><img src="images/67qas.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/75ke.jpg"><img src="images/75ke.jpg" width="120"/></a></td>
+</tr>
+<tr>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/8JKM.jpg"><img src="images/8JKM.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/8jpwt0.jpg"><img src="images/8jpwt0.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/B1QAZ6.jpg"><img src="images/B1QAZ6.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/CCX8.jpg"><img src="images/CCX8.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/EPOD.jpg"><img src="images/EPOD.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/ER6Y.jpg"><img src="images/ER6Y.jpg" width="120"/></a></td>
+</tr>
+<tr>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/EWSP.jpg"><img src="images/EWSP.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/GIOGp.jpg"><img src="images/GIOGp.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/HCDS.jpg"><img src="images/HCDS.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/JBWkEs.jpg"><img src="images/JBWkEs.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/kJtOfk.jpg"><img src="images/kJtOfk.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/MFMH.jpg"><img src="images/MFMH.jpg" width="120"/></a></td>
+</tr>
+<tr>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/NJSEX.jpg"><img src="images/NJSEX.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/R6AB.jpg"><img src="images/R6AB.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/TVHF.jpg"><img src="images/TVHF.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/Vb4cG.jpg"><img src="images/Vb4cG.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/XaNqQx.jpg"><img src="images/XaNqQx.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/YULM.jpg"><img src="images/YULM.jpg" width="120"/></a></td>
+</tr>
+<tr>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/b6yc.jpg"><img src="images/b6yc.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/bCWaLR.jpg"><img src="images/bCWaLR.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/d3no.jpg"><img src="images/d3no.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/3eplzv.jpg"><img src="images/3eplzv.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/iq1sZo.jpg"><img src="images/iq1sZo.jpg" width="120"/></a></td>
+<td><a href="https://huggingface.co/Graf-J/captcha-crnn-finetuned/resolve/main/images/KKh8Q.jpg"><img src="images/KKh8Q.jpg" width="120"/></a></td>
+</tr>
+</table>
+</div>
+---
+## 🚀 Quick Start (Pipeline - Recommended)
+The easiest way to perform inference is using the custom Hugging Face pipeline.
+```python
+from transformers import pipeline
+from PIL import Image
+# Initialize the pipeline
+pipe = pipeline(
+    task="captcha-recognition",
+    model="Graf-J/captcha-conv-transformer-base",
+    trust_remote_code=True
+)
+# Load and predict
+img = Image.open("path/to/image.png")
+result = pipe(img)
+print(f"Decoded Text: {result['prediction']}")
+```
+## 🔬 Advanced Usage (Raw Logits & Custom Decoding)
+Use this method if you need access to the raw logits or internal hidden states.
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoProcessor
+# Load Model & Custom Processor
+repo_id = "Graf-J/captcha-conv-transformer-base"
+processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
+model.eval()
+# Load and process image
+img = Image.open("path/to/image.png")
+inputs = processor(img)
+# Inference
+with torch.no_grad():
+    outputs = model(inputs["pixel_values"])
+    logits = outputs.logits
+# Decode the prediction via CTC logic
+prediction = processor.batch_decode(logits)[0]
+print(f"Prediction: '{prediction}'")
+```
+---
+## ⚙️ Training
+The base model was trained on a refined version of the [hammer888/captcha-data](https://huggingface.co/datasets/hammer888/captcha-data) (1,365,874 images). This dataset underwent a specialized cleaning process where multiple pre-trained models were used to identify and prune inconsistent data. Specifically, images where models were "confidently incorrect" regarding casing (upper/lower-case errors) were removed to ensure high-fidelity ground truth for the final training run.
+### **Parameters**
+- **Optimizer:** Adam (lr=0.0005)
+- **Scheduler:** ReduceLROnPlateau (factor=0.5, patience=3)
+- **Batch Size:** 128
+- **Loss Function:** CTCLoss
+- **Augmentations:** ElasticTransform, Random Rotation, Grayscale Resize
+---
+## 🔍 Error Analysis
+The following confusion matrices illustrate the character-level performance across the alphanumeric vocabulary for the test dataset of the images generated via Python.
+### **Full Confusion Matrix**
+![Full-Confusion-Matrix](images/confusion-matrix.png)
+### **Misclassification Deep Dive**
+This matrix highlights only the misclassification patterns, stripping away correct predictions to visualize which character pairs (such as '0' vs 'O' or '1' vs 'l') the model most frequently confuses. While the dataset underwent a specialized cleaning process to minimize noisy labels, the confusion matrix reveals a residual pattern of misclassifications between visually similar upper and lowercase characters.
+![Full-Confusion-Matrix](images/confusion-matrix-no-diagonal.png)
+---
+## ⚖️ **License & Citation**
+This project is licensed under the **MIT License**. If you use this model in your research, portfolio, or applications, please attribute the author.

__pycache__/configuration_captcha.cpython-313.pyc ADDED Viewed

Binary file (1.11 kB). View file

__pycache__/modeling_captcha.cpython-313.pyc ADDED Viewed

Binary file (5.05 kB). View file

__pycache__/processing_captcha.cpython-313.pyc ADDED Viewed

Binary file (3.1 kB). View file

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "CaptchaConvolutionalTransformer"
+  ],
+  "d_model": 1280,
+  "dim_feedforward": 2048,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "model_type": "captcha_convolutional_transformer",
+  "nhead": 8,
+  "num_chars": 63,
+  "num_layers": 1,
+  "transformers_version": "5.1.0",
+  "auto_map": {
+    "AutoConfig": "configuration_captcha.CaptchaConfig",
+    "AutoModel": "modeling_captcha.CaptchaConvolutionalTransformer",
+    "AutoProcessor": "processing_captcha.CaptchaProcessor"
+  },
+  "custom_pipelines": {
+    "captcha-recognition": {
+      "impl": "pipeline.CaptchaPipeline",
+      "pt": ["AutoModel"],
+      "type": "multimodal"
+    }
+  }
+}

configuration_captcha.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from transformers import PretrainedConfig
+class CaptchaConfig(PretrainedConfig):
+    model_type = "captcha_convolutional_transformer"
+    def __init__(
+        self,
+        num_chars=63,
+        d_model=1280,
+        nhead=8,
+        num_layers=1,
+        dim_feedforward=2048,
+        dropout=0.1,
+        **kwargs
+    ):
+        self.num_chars = num_chars
+        self.d_model = d_model
+        self.nhead = nhead
+        self.num_layers = num_layers
+        self.dim_feedforward = dim_feedforward
+        self.dropout = dropout
+        super().__init__(**kwargs)

images/3eplzv.jpg ADDED Viewed

images/46CN5W.jpg ADDED Viewed

images/5820.jpg ADDED Viewed

images/6521.jpg ADDED Viewed

images/67qas.jpg ADDED Viewed

images/75ke.jpg ADDED Viewed

images/8JKM.jpg ADDED Viewed

images/8jpwt0.jpg ADDED Viewed

images/B1QAZ6.jpg ADDED Viewed

images/CAPTCHA.png ADDED Viewed

images/CCX8.jpg ADDED Viewed

images/EPOD.jpg ADDED Viewed

images/ER6Y.jpg ADDED Viewed

images/EWSP.jpg ADDED Viewed

images/GIOGp.jpg ADDED Viewed

images/HCDS.jpg ADDED Viewed

images/JBWkEs.jpg ADDED Viewed

images/KKh8Q.jpg ADDED Viewed

images/MFMH.jpg ADDED Viewed

images/NJSEX.jpg ADDED Viewed

images/R6AB.jpg ADDED Viewed

images/TVHF.jpg ADDED Viewed

images/Vb4cG.jpg ADDED Viewed

images/XaNqQx.jpg ADDED Viewed

images/YULM.jpg ADDED Viewed

images/abfsh.jpg ADDED Viewed

images/b6yc.jpg ADDED Viewed

images/bCWaLR.jpg ADDED Viewed

images/confusion-matrix-no-diagonal.png ADDED Viewed

images/confusion-matrix.png ADDED Viewed

images/d3no.jpg ADDED Viewed

images/iq1sZo.jpg ADDED Viewed

images/kJtOfk.jpg ADDED Viewed

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83da1e97cd334935d9c29b25d64d87fa1c993ab16966512f00710392d4f8bfc9
+size 51685900

modeling_captcha.py ADDED Viewed

	@@ -0,0 +1,99 @@

+import math
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import SequenceClassifierOutput
+from .configuration_captcha import CaptchaConfig
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model, max_len=500):
+        super().__init__()
+        pe = torch.zeros(max_len, d_model)
+        position = torch.arange(0, max_len).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.register_buffer("pe", pe)
+    def forward(self, x):
+        return x + self.pe[:, : x.size(1)]
+class CaptchaConvolutionalTransformer(PreTrainedModel):
+    config_class = CaptchaConfig
+    def __init__(self, config):
+        super().__init__(config)
+        # CNN Feature Extractor
+        self.conv = nn.Sequential(
+            nn.Conv2d(1, 32, kernel_size=3, padding=1),
+            nn.BatchNorm2d(32),
+            nn.SiLU(),
+            nn.MaxPool2d(2, 2),
+            nn.Conv2d(32, 64, kernel_size=3, padding=1),
+            nn.BatchNorm2d(64),
+            nn.SiLU(),
+            nn.MaxPool2d(2, 2),
+            nn.Conv2d(64, 128, kernel_size=3, padding=1),
+            nn.BatchNorm2d(128),
+            nn.SiLU(),
+            nn.MaxPool2d(kernel_size=(2, 1)),
+            nn.Conv2d(128, 256, kernel_size=3, padding=1),
+            nn.BatchNorm2d(256),
+            nn.SiLU(),
+        )
+        # Positional Encoding
+        self.positional_encoding = PositionalEncoding(config.d_model)
+        # Transformer Encoder
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.d_model,
+            nhead=config.nhead,
+            dim_feedforward=config.dim_feedforward,
+            dropout=config.dropout,
+            activation="gelu",
+            batch_first=True,
+            norm_first=True,
+        )
+        self.transformer = nn.TransformerEncoder(
+            encoder_layer,
+            num_layers=config.num_layers,
+        )
+        # Classification Head
+        self.classifier = nn.Linear(config.d_model, config.num_chars)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(self, pixel_values, labels=None):
+        """
+        pixel_values: (batch, 1, H, W)
+        """
+        # Extract features
+        x = self.conv(pixel_values) # (B, 256, H_final, W_final)
+        # Prepare sequence: Permute to (Batch, Width, Channels, Height)
+        x = x.permute(0, 3, 1, 2)
+        b, t, c, h = x.size()
+        # Flatten Channels and Height into the d_model dimension
+        x = x.reshape(b, t, c * h) # (B, T, d_model)
+        # Apply Transformer logic
+        x = self.positional_encoding(x)
+        x = self.transformer(x)
+        # Map to character logits
+        logits = self.classifier(x) # (B, T, num_chars)
+        # Return an output object
+        return SequenceClassifierOutput(logits=logits)

pipeline.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from transformers import Pipeline
+import torch
+class CaptchaPipeline(Pipeline):
+    def _sanitize_parameters(self, **kwargs):
+        return {}, {}, {}
+    def preprocess(self, image):
+        return self.processor(image)
+    def _forward(self, model_inputs):
+        with torch.no_grad():
+            outputs = self.model(model_inputs["pixel_values"])
+        return outputs
+    def postprocess(self, model_outputs):
+        logits = model_outputs.logits
+        prediction = self.processor.batch_decode(logits)[0]
+        return {"prediction": prediction}

processing_captcha.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import string
+import torch
+import torchvision.transforms.functional as F
+from transformers.processing_utils import ProcessorMixin
+class CaptchaProcessor(ProcessorMixin):
+    attributes = []
+    def __init__(self, vocab=None, **kwargs):
+        super().__init__(**kwargs)
+        self.vocab = vocab or (string.ascii_lowercase + string.ascii_uppercase + string.digits)
+        self.idx_to_char = {i + 1: c for i, c in enumerate(self.vocab)}
+        self.idx_to_char[0] = ""
+    def __call__(self, images):
+        """
+        Converts PIL images to the tensor format the CRNN expects.
+        """
+        if not isinstance(images, list):
+            images = [images]
+        processed_images = []
+        for img in images:
+            # Convert to Grayscale
+            img = img.convert("L")
+            # Resize to your model's expected input (Width, Height)
+            img = img.resize((150, 40))
+            # Convert to Tensor and Scale to [0, 1]
+            img_tensor = F.to_tensor(img)
+            processed_images.append(img_tensor)
+        return {"pixel_values": torch.stack(processed_images)}
+    def batch_decode(self, logits):
+        """
+        CTC decoding logic.
+        """
+        tokens = torch.argmax(logits, dim=-1)
+        if len(tokens.shape) == 1:
+            tokens = tokens.unsqueeze(0)
+        decoded_strings = []
+        for batch_item in tokens:
+            char_list = []
+            for i in range(len(batch_item)):
+                token = batch_item[i].item()
+                if token != 0:
+                    if i > 0 and batch_item[i] == batch_item[i - 1]:
+                        continue
+                    char_list.append(self.idx_to_char.get(token, ""))
+            decoded_strings.append("".join(char_list))
+        return decoded_strings

processor_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "processor_class": "CaptchaProcessor",
+  "vocab": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",
+  "auto_map": {
+    "AutoProcessor": "processing_captcha.CaptchaProcessor"
+  }
+}