Desant Phishing Detection Model

CLIP RN50x64-based binary classifier for detecting phishing web pages from screenshots.

Built by Desant.ai for real-time phishing protection in the Desant Phishing Detectior Chrome extension platform.

Model Description

This model classifies web page screenshots as SAFE (Class 0) or MALICIOUS/phishing (Class 1). It uses OpenAI's CLIP RN50x64 as a frozen visual feature extractor, with a custom 3-layer MLP classifier head trained on thousands of real-world phishing and legitimate screenshots.

Note: The model is trained using OpenAI's CLIP (clip-by-openai) and is also compatible with OpenCLIP (open_clip_torch) for inference. The production backend uses OpenCLIP for serving.

The model is designed to detect phishing login forms — fake pages that mimic legitimate services (banks, email providers, social media, etc.) to steal user credentials.

Key Features

High-resolution analysis: 448x448 pixel input (4x more pixels than ViT-B/32)
Real-world training data: Sourced from PhishTank, OpenPhish, URLhaus, and AlienVault OTX
Production-deployed: Powers the Desant Phishing Detectior Chrome extension and backend API used in the Hugging Face Space demo
Fast inference: ~50ms on GPU.

Architecture

Input: Web page screenshot (any resolution)
        │
        ▼
┌──────────────────────────────────┐
│  Preprocessing                   │
│  • Aspect-ratio preserving       │
│    resize to 448×448             │
│  • Mean color padding            │
│    (CLIP mean: 123, 117, 104)    │
│  • CLIP normalization            │
│    mean=[0.481, 0.458, 0.408]    │
│    std=[0.269, 0.261, 0.276]     │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  CLIP RN50x64 Vision Encoder     │  ← Frozen (pre-trained weights)
│  ResNet-50 with 64× wider        │
│  channels                        │
│  Output: 1024-dim feature vector  │
└──────────┬───────────────────────┘
           │
           ▼
┌──────────────────────────────────┐
│  Classifier Head (trainable)     │
│                                  │
│  Dropout(0.5)                    │
│  Linear(1024 → 512) + ReLU      │
│  Dropout(0.3)                    │
│  Linear(512 → 128) + ReLU       │
│  Dropout(0.2)                    │
│  Linear(128 → 2)                 │
│                                  │
│  Output: [safe_logit, mal_logit] │
└──────────┬───────────────────────┘
           │
           ▼
     Softmax → Probabilities
     Class 0: SAFE
     Class 1: MALICIOUS (phishing)

Training Details

Training Data

Source	Class	Description
PhishTank, OpenPhish, URLhaus, AlienVault OTX	MALICIOUS (Class 1)	Real phishing login form screenshots captured at 1920x941
Curated safe URLs	SAFE (Class 0)	Legitimate login pages, normal web pages

Training Configuration

Parameter	Value
Base model	CLIP RN50x64 (OpenAI CLIP, compatible with OpenCLIP)
Input resolution	448 × 448 pixels
Original screenshot resolution	1920 × 941 pixels
Batch size	32 (effective 64 with gradient accumulation)
Gradient accumulation steps	2
Max epochs	25
Early stopping patience	10 epochs
Optimizer	AdamW (lr=1e-4, weight_decay=1e-4, betas=(0.9, 0.999))
LR scheduler	ReduceLROnPlateau (factor=0.5, patience=3)
Loss function	CrossEntropyLoss (unweighted)
Class balancing	WeightedRandomSampler
Data split	80% train / 20% validation
Mixed precision	Enabled (AMP)
CLIP encoder	Frozen (only classifier head is trained)

Data Augmentation

Augmentation	Details
Aspect-ratio preserving resize	Resize to 448x448 with CLIP mean color padding
Random horizontal flip	p=0.5
Color jitter	brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1

Preprocessing Pipeline

Load screenshot (PNG, 1920x941 original resolution)
Preserve aspect ratio, resize to fit 448x448
Pad with CLIP mean color (123, 117, 104) to fill 448x448 canvas
Convert to tensor [0, 1]
Normalize with CLIP statistics: mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]

Performance

Metric	Score
Accuracy	92%
Malicious Recall	93%
Safe Precision	94%
False Positive Rate	2–6%
F1 Score	~0.94

Inference Speed

Hardware	Inference Time	Preprocessing
NVIDIA RTX 4090	~30ms	~20ms
NVIDIA T4	~80ms	~25ms
CPU (i7-13700K)	~500ms	~30ms

Usage

Quick Start (PyTorch)

import torch
import torch.nn as nn
import clip
from PIL import Image

# Define the classifier architecture (must match training)
class CLIPClassifier(nn.Module):
    def __init__(self, clip_model, num_classes=2):
        super().__init__()
        self.clip_visual = clip_model.visual
        with torch.no_grad():
            dummy = torch.randn(1, 3, 448, 448).float().to(next(clip_model.parameters()).device)
            features = self.clip_visual(dummy)
            feature_dim = features.shape[1]
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        features = self.clip_visual(x.float())
        return self.classifier(features)

# Load CLIP base model
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, _ = clip.load("RN50x64", device=device, jit=False)
clip_model = clip_model.float()

# Build classifier and load trained weights
model = CLIPClassifier(clip_model, num_classes=2).to(device)
state_dict = torch.load("model_1920x941_CLIP_RN50x64_best.pth", map_location=device)
model.load_state_dict(state_dict)
model.eval()

# Preprocess a screenshot
from torchvision import transforms

def aspect_ratio_resize(image, target_size=(448, 448)):
    """Resize preserving aspect ratio with CLIP mean color padding."""
    tw, th = target_size
    w, h = image.size
    scale = min(tw / w, th / h)
    nw, nh = int(w * scale), int(h * scale)
    resized = image.resize((nw, nh), Image.LANCZOS)
    pad_color = (int(0.48145466*255), int(0.4578275*255), int(0.40821073*255))
    canvas = Image.new("RGB", (tw, th), pad_color)
    canvas.paste(resized, ((tw - nw) // 2, (th - nh) // 2))
    return canvas

preprocess = transforms.Compose([
    transforms.Lambda(lambda img: aspect_ratio_resize(img, (448, 448))),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.48145466, 0.4578275, 0.40821073],
        std=[0.26862954, 0.26130258, 0.27577711],
    ),
])

# Run inference
image = Image.open("screenshot.png").convert("RGB")
input_tensor = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(input_tensor)
    probs = torch.softmax(logits, dim=1)
    safe_prob = probs[0][0].item()
    malicious_prob = probs[0][1].item()
    prediction = "MALICIOUS" if malicious_prob > 0.5 else "SAFE"

print(f"Prediction: {prediction}")
print(f"Safe probability:      {safe_prob:.4f}")
print(f"Malicious probability: {malicious_prob:.4f}")

Intended Use

Primary use case: Real-time phishing detection in web browsers via the Desant Phishing Detectior Chrome extension

Suitable for:

Browser extensions that analyze page screenshots
Email security systems checking embedded links
Web crawlers classifying pages at scale
Security research on phishing detection

Not suitable for:

General image classification (model is specialized for web page screenshots)
Detecting non-visual phishing attacks (e.g., homograph attacks without visual cues)
Replacing comprehensive security solutions (this is one layer of defense)

Limitations

Training bias: Model is primarily trained on English-language phishing pages; performance may vary for other languages
Evasion: Sophisticated attackers may craft pages that visually differ from training data
Screenshot dependency: Requires a full-page screenshot; partial captures may reduce accuracy
Resolution sensitivity: Best performance with screenshots at or near 1920x941; very small or very large screenshots may see degraded accuracy
Login form focus: Model is optimized for detecting fake login forms specifically; other phishing types (e.g., fake payment pages without login fields) may be less reliably detected

Ethical Considerations

This model is designed for defensive cybersecurity — protecting users from phishing attacks. It should not be used to:

Create or improve phishing pages
Bypass existing security systems
Target or profile individuals

Citation

@software{desant_phishing_detection_2025,
  author = {Desant.ai},
  title = {CLIP-based Phishing Screenshot Detection Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/desant-ai/desant-phishing-inference}
}

Evaluation results

Accuracy
self-reported

0.950
Malicious Recall
self-reported

0.930
F1 Score
self-reported

0.940

desant-ai
/

desant-phishing-inference