Random Forest / XGBoost Model for URL Phishing Detection
This repository contains a trained tree-based classifier for detecting phishing URLs. The model was trained from the PhiUSIIL_Phishing_URL_Dataset.csv with lightweight, URL-only lexical and structural features. On the held-out test split it achieved high accuracy and F1.
Highlights
- Backend: gradient-boosted trees via XGBoost (uses GPU if available; falls back to CPU).
- Input: raw URL string only (no external DNS/WHOIS calls needed).
- Features: length, character counts, digit ratio, IPv4 presence, common phishing tokens, scheme/TLD heuristics.
Test metrics (from notebook)
- accuracy: 0.9952
- precision: 0.9928
- recall: 0.9989
- f1: 0.9958
- roc_auc: 0.9976
Files
rf_url_phishing_xgboost_bst.joblib: joblib bundle with the trained model and metadata.inference.py: helpers to load the bundle and runpredict_url().requirements.txt: minimal dependencies for local inference.
Quick start (local)
- Install dependencies
pip install -r requirements.txt
- Predict a single URL
from inference import load_bundle, predict_url
bundle = load_bundle("rf_url_phishing_xgboost_bst.joblib")
result = predict_url(
url="http://secure-login-account-update.example.com/session?id=123",
bundle=bundle,
threshold=0.5,
)
print(result)
Bundle contents The joblib bundle contains:
model: trained XGBoost boosterfeature_cols: ordered list of feature names expected by the modelurl_col: original URL column namelabel_col: label column name used in trainingmodel_type: string identifying the backend (here:xgboost_bst)
License This model is provided for research and educational purposes only. Evaluate thoroughly before use in production.