Desant Phishing Detection Model
CLIP RN50x64-based binary classifier for detecting phishing web pages from screenshots.
Built by Desant.ai for real-time phishing protection in the Desant Phishing Detectior Chrome extension platform.
Model Description
This model classifies web page screenshots as SAFE (Class 0) or MALICIOUS/phishing (Class 1). It uses OpenAI's CLIP RN50x64 as a frozen visual feature extractor, with a custom 3-layer MLP classifier head trained on thousands of real-world phishing and legitimate screenshots.
Note: The model is trained using OpenAI's CLIP (
clip-by-openai) and is also compatible with OpenCLIP (open_clip_torch) for inference. The production backend uses OpenCLIP for serving.
The model is designed to detect phishing login forms β fake pages that mimic legitimate services (banks, email providers, social media, etc.) to steal user credentials.
Key Features
- High-resolution analysis: 448x448 pixel input (4x more pixels than ViT-B/32)
- Real-world training data: Sourced from PhishTank, OpenPhish, URLhaus, and AlienVault OTX
- Production-deployed: Powers the Desant Phishing Detectior Chrome extension and backend API used in the Hugging Face Space demo
- Fast inference: ~50ms on GPU.
Architecture
Input: Web page screenshot (any resolution)
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Preprocessing β
β β’ Aspect-ratio preserving β
β resize to 448Γ448 β
β β’ Mean color padding β
β (CLIP mean: 123, 117, 104) β
β β’ CLIP normalization β
β mean=[0.481, 0.458, 0.408] β
β std=[0.269, 0.261, 0.276] β
ββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β CLIP RN50x64 Vision Encoder β β Frozen (pre-trained weights)
β ResNet-50 with 64Γ wider β
β channels β
β Output: 1024-dim feature vector β
ββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Classifier Head (trainable) β
β β
β Dropout(0.5) β
β Linear(1024 β 512) + ReLU β
β Dropout(0.3) β
β Linear(512 β 128) + ReLU β
β Dropout(0.2) β
β Linear(128 β 2) β
β β
β Output: [safe_logit, mal_logit] β
ββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
Softmax β Probabilities
Class 0: SAFE
Class 1: MALICIOUS (phishing)
Training Details
Training Data
| Source | Class | Description |
|---|---|---|
| PhishTank, OpenPhish, URLhaus, AlienVault OTX | MALICIOUS (Class 1) | Real phishing login form screenshots captured at 1920x941 |
| Curated safe URLs | SAFE (Class 0) | Legitimate login pages, normal web pages |
Training Configuration
| Parameter | Value |
|---|---|
| Base model | CLIP RN50x64 (OpenAI CLIP, compatible with OpenCLIP) |
| Input resolution | 448 Γ 448 pixels |
| Original screenshot resolution | 1920 Γ 941 pixels |
| Batch size | 32 (effective 64 with gradient accumulation) |
| Gradient accumulation steps | 2 |
| Max epochs | 25 |
| Early stopping patience | 10 epochs |
| Optimizer | AdamW (lr=1e-4, weight_decay=1e-4, betas=(0.9, 0.999)) |
| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=3) |
| Loss function | CrossEntropyLoss (unweighted) |
| Class balancing | WeightedRandomSampler |
| Data split | 80% train / 20% validation |
| Mixed precision | Enabled (AMP) |
| CLIP encoder | Frozen (only classifier head is trained) |
Data Augmentation
| Augmentation | Details |
|---|---|
| Aspect-ratio preserving resize | Resize to 448x448 with CLIP mean color padding |
| Random horizontal flip | p=0.5 |
| Color jitter | brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1 |
Preprocessing Pipeline
- Load screenshot (PNG, 1920x941 original resolution)
- Preserve aspect ratio, resize to fit 448x448
- Pad with CLIP mean color
(123, 117, 104)to fill 448x448 canvas - Convert to tensor
[0, 1] - Normalize with CLIP statistics:
mean=[0.48145466, 0.4578275, 0.40821073],std=[0.26862954, 0.26130258, 0.27577711]
Performance
| Metric | Score |
|---|---|
| Accuracy | 92% |
| Malicious Recall | 93% |
| Safe Precision | 94% |
| False Positive Rate | 2β6% |
| F1 Score | ~0.94 |
Inference Speed
| Hardware | Inference Time | Preprocessing |
|---|---|---|
| NVIDIA RTX 4090 | ~30ms | ~20ms |
| NVIDIA T4 | ~80ms | ~25ms |
| CPU (i7-13700K) | ~500ms | ~30ms |
Usage
Quick Start (PyTorch)
import torch
import torch.nn as nn
import clip
from PIL import Image
# Define the classifier architecture (must match training)
class CLIPClassifier(nn.Module):
def __init__(self, clip_model, num_classes=2):
super().__init__()
self.clip_visual = clip_model.visual
with torch.no_grad():
dummy = torch.randn(1, 3, 448, 448).float().to(next(clip_model.parameters()).device)
features = self.clip_visual(dummy)
feature_dim = features.shape[1]
self.classifier = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(feature_dim, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, num_classes),
)
def forward(self, x):
features = self.clip_visual(x.float())
return self.classifier(features)
# Load CLIP base model
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, _ = clip.load("RN50x64", device=device, jit=False)
clip_model = clip_model.float()
# Build classifier and load trained weights
model = CLIPClassifier(clip_model, num_classes=2).to(device)
state_dict = torch.load("model_1920x941_CLIP_RN50x64_best.pth", map_location=device)
model.load_state_dict(state_dict)
model.eval()
# Preprocess a screenshot
from torchvision import transforms
def aspect_ratio_resize(image, target_size=(448, 448)):
"""Resize preserving aspect ratio with CLIP mean color padding."""
tw, th = target_size
w, h = image.size
scale = min(tw / w, th / h)
nw, nh = int(w * scale), int(h * scale)
resized = image.resize((nw, nh), Image.LANCZOS)
pad_color = (int(0.48145466*255), int(0.4578275*255), int(0.40821073*255))
canvas = Image.new("RGB", (tw, th), pad_color)
canvas.paste(resized, ((tw - nw) // 2, (th - nh) // 2))
return canvas
preprocess = transforms.Compose([
transforms.Lambda(lambda img: aspect_ratio_resize(img, (448, 448))),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.48145466, 0.4578275, 0.40821073],
std=[0.26862954, 0.26130258, 0.27577711],
),
])
# Run inference
image = Image.open("screenshot.png").convert("RGB")
input_tensor = preprocess(image).unsqueeze(0).to(device)
with torch.no_grad():
logits = model(input_tensor)
probs = torch.softmax(logits, dim=1)
safe_prob = probs[0][0].item()
malicious_prob = probs[0][1].item()
prediction = "MALICIOUS" if malicious_prob > 0.5 else "SAFE"
print(f"Prediction: {prediction}")
print(f"Safe probability: {safe_prob:.4f}")
print(f"Malicious probability: {malicious_prob:.4f}")
Intended Use
Primary use case: Real-time phishing detection in web browsers via the Desant Phishing Detectior Chrome extension
Suitable for:
- Browser extensions that analyze page screenshots
- Email security systems checking embedded links
- Web crawlers classifying pages at scale
- Security research on phishing detection
Not suitable for:
- General image classification (model is specialized for web page screenshots)
- Detecting non-visual phishing attacks (e.g., homograph attacks without visual cues)
- Replacing comprehensive security solutions (this is one layer of defense)
Limitations
- Training bias: Model is primarily trained on English-language phishing pages; performance may vary for other languages
- Evasion: Sophisticated attackers may craft pages that visually differ from training data
- Screenshot dependency: Requires a full-page screenshot; partial captures may reduce accuracy
- Resolution sensitivity: Best performance with screenshots at or near 1920x941; very small or very large screenshots may see degraded accuracy
- Login form focus: Model is optimized for detecting fake login forms specifically; other phishing types (e.g., fake payment pages without login fields) may be less reliably detected
Ethical Considerations
This model is designed for defensive cybersecurity β protecting users from phishing attacks. It should not be used to:
- Create or improve phishing pages
- Bypass existing security systems
- Target or profile individuals
Citation
@software{desant_phishing_detection_2025,
author = {Desant.ai},
title = {CLIP-based Phishing Screenshot Detection Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/desant-ai/desant-phishing-inference}
}
Links
- Live Demo: Hugging Face Space
- Chrome Extension: Desant Phishing Detectior Chrome extension
- Organization: Desant.ai
Evaluation results
- Accuracyself-reported0.950
- Malicious Recallself-reported0.930
- F1 Scoreself-reported0.940