🛡️ ModernBERT URL Phishing Detector
This is a fine-tuned ModernBERT-base (149M Parameters) model designed for high-performance detection of phishing and malicious URLs. It has been strictly trained using real-world generalization techniques (Global Deduplication, URL Sanitization, and Domain-Based Group Splitting) to prevent structural overfitting and data leakage.
🧑💻 Authors & Researchers
🚀 Quick Start (Inference)
You can use this model directly with the transformers library.
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
# Load Model
model_id = "[KULLANICI_ADIN]/modernbert-phishing-detector"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
url = "https://secure-login-verify.suspicious-domain.tk/login"
# Tokenize
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=256, padding=True)
# Predict
with torch.no_grad():
outputs = model(**inputs)
probs = F.softmax(outputs.logits, dim=-1)
phish_prob = probs[0][1].item()
print(f"Phishing Probability: {phish_prob * 100:.2f}%")
# Output > 60% indicates SUSPICIOUS, > 85% indicates PHISHING
🔬 Model Performance & R&D Process
Training a phishing detector is notorious for dataset biases. Early iterations (Phase 1) on standard datasets yielded a 100% F1-Score but failed drastically (46% accuracy) on Out-of-Distribution (OOD) testing because the model memorized domain names instead of malicious morphological patterns.
To solve this, we applied:
- Hybrid Dataset Pooling: Combined and normalized labels from PhiUSIIL and Kaggle Malicious Phish datasets.
- Global Deduplication: Stripped protocols (
http,www) and preserved 'malicious' labels in case of conflicts to increase the safety margin. - Domain-Based Group Splitting: Utilized
GroupShuffleSplitensuring that sub-URLs of a specific domain exist only in either the training or the testing set, preventing data leakage.
Final Validation Metrics (Unseen Domains)
- F1-Score: 0.8300
- Accuracy: 83.00%
- Precision: 0.84
- Recall: 0.83
Unlike deceptive 99% accuracy models, this model provides scientifically honest generalization, analyzing subdirectory depth, character entropy, and suspicious keyword sequences.
⚙️ Training Details
- Base Model:
answerdotai/ModernBERT-base - Precision: BF16 Mixed Precision
- Hardware Optimizations: TorchDynamo (
torch.compile), Flash Attention 2 & Unpadding - Learning Rate Scheduler: Cosine Decay (Max LR: 2e-5)
- Weight Decay: 0.1 (Strict L2 Regularization)
📄 License
This project is licensed under the Apache 2.0 License.
- Downloads last month
- 55
Model tree for ilkayO/modernbert-phishing-detector
Base model
answerdotai/ModernBERT-base