🛡️ ModernBERT URL Phishing Detector

This is a fine-tuned ModernBERT-base (149M Parameters) model designed for high-performance detection of phishing and malicious URLs. It has been strictly trained using real-world generalization techniques (Global Deduplication, URL Sanitization, and Domain-Based Group Splitting) to prevent structural overfitting and data leakage.

🧑‍💻 Authors & Researchers

Ilkay ONAY - LinkedIn | GitHub
Bayram BAYRAKTAR - LinkedIn | GitHub

🚀 Quick Start (Inference)

You can use this model directly with the transformers library.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

# Load Model
model_id = "[KULLANICI_ADIN]/modernbert-phishing-detector"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

url = "https://secure-login-verify.suspicious-domain.tk/login"

# Tokenize
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=256, padding=True)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=-1)

phish_prob = probs[0][1].item()
print(f"Phishing Probability: {phish_prob * 100:.2f}%")
# Output > 60% indicates SUSPICIOUS, > 85% indicates PHISHING

🔬 Model Performance & R&D Process

Training a phishing detector is notorious for dataset biases. Early iterations (Phase 1) on standard datasets yielded a 100% F1-Score but failed drastically (46% accuracy) on Out-of-Distribution (OOD) testing because the model memorized domain names instead of malicious morphological patterns.

To solve this, we applied:

Hybrid Dataset Pooling: Combined and normalized labels from PhiUSIIL and Kaggle Malicious Phish datasets.
Global Deduplication: Stripped protocols (http, www) and preserved 'malicious' labels in case of conflicts to increase the safety margin.
Domain-Based Group Splitting: Utilized GroupShuffleSplit ensuring that sub-URLs of a specific domain exist only in either the training or the testing set, preventing data leakage.

Final Validation Metrics (Unseen Domains)

F1-Score: 0.8300
Accuracy: 83.00%
Precision: 0.84
Recall: 0.83

Unlike deceptive 99% accuracy models, this model provides scientifically honest generalization, analyzing subdirectory depth, character entropy, and suspicious keyword sequences.

⚙️ Training Details

Base Model: answerdotai/ModernBERT-base
Precision: BF16 Mixed Precision
Hardware Optimizations: TorchDynamo (torch.compile), Flash Attention 2 & Unpadding
Learning Rate Scheduler: Cosine Decay (Max LR: 2e-5)
Weight Decay: 0.1 (Strict L2 Regularization)

📄 License

This project is licensed under the Apache 2.0 License.

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for ilkayO/modernbert-phishing-detector

Base model

answerdotai/ModernBERT-base

Finetuned

(1351)

this model

ilkayO
/

modernbert-phishing-detector