🛡️ ModernBERT URL Phishing Detector

GitHub Repo Open In Colab Hugging Face Spaces

This is a fine-tuned ModernBERT-base (149M Parameters) model designed for high-performance detection of phishing and malicious URLs. It has been strictly trained using real-world generalization techniques (Global Deduplication, URL Sanitization, and Domain-Based Group Splitting) to prevent structural overfitting and data leakage.

🧑‍💻 Authors & Researchers


🚀 Quick Start (Inference)

You can use this model directly with the transformers library.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

# Load Model
model_id = "[KULLANICI_ADIN]/modernbert-phishing-detector"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

url = "https://secure-login-verify.suspicious-domain.tk/login"

# Tokenize
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=256, padding=True)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=-1)

phish_prob = probs[0][1].item()
print(f"Phishing Probability: {phish_prob * 100:.2f}%")
# Output > 60% indicates SUSPICIOUS, > 85% indicates PHISHING

🔬 Model Performance & R&D Process

Training a phishing detector is notorious for dataset biases. Early iterations (Phase 1) on standard datasets yielded a 100% F1-Score but failed drastically (46% accuracy) on Out-of-Distribution (OOD) testing because the model memorized domain names instead of malicious morphological patterns.

To solve this, we applied:

  1. Hybrid Dataset Pooling: Combined and normalized labels from PhiUSIIL and Kaggle Malicious Phish datasets.
  2. Global Deduplication: Stripped protocols (http, www) and preserved 'malicious' labels in case of conflicts to increase the safety margin.
  3. Domain-Based Group Splitting: Utilized GroupShuffleSplit ensuring that sub-URLs of a specific domain exist only in either the training or the testing set, preventing data leakage.

Final Validation Metrics (Unseen Domains)

  • F1-Score: 0.8300
  • Accuracy: 83.00%
  • Precision: 0.84
  • Recall: 0.83

Unlike deceptive 99% accuracy models, this model provides scientifically honest generalization, analyzing subdirectory depth, character entropy, and suspicious keyword sequences.


⚙️ Training Details

  • Base Model: answerdotai/ModernBERT-base
  • Precision: BF16 Mixed Precision
  • Hardware Optimizations: TorchDynamo (torch.compile), Flash Attention 2 & Unpadding
  • Learning Rate Scheduler: Cosine Decay (Max LR: 2e-5)
  • Weight Decay: 0.1 (Strict L2 Regularization)

📄 License

This project is licensed under the Apache 2.0 License.

Downloads last month
55
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilkayO/modernbert-phishing-detector

Finetuned
(1248)
this model

Space using ilkayO/modernbert-phishing-detector 1