Upload README.md with huggingface_hub

b0ecf99 verified 7 months ago

1.83 kB

Random Forest / XGBoost Model for URL Phishing Detection

This repository contains a trained tree-based classifier for detecting phishing URLs. The model was trained from the PhiUSIIL_Phishing_URL_Dataset.csv with lightweight, URL-only lexical and structural features. On the held-out test split it achieved high accuracy and F1.

Highlights

Backend: gradient-boosted trees via XGBoost (uses GPU if available; falls back to CPU).
Input: raw URL string only (no external DNS/WHOIS calls needed).
Features: length, character counts, digit ratio, IPv4 presence, common phishing tokens, scheme/TLD heuristics.

Test metrics (from notebook)

accuracy: 0.9952
precision: 0.9928
recall: 0.9989
f1: 0.9958
roc_auc: 0.9976

Files

rf_url_phishing_xgboost_bst.joblib: joblib bundle with the trained model and metadata.
inference.py: helpers to load the bundle and run predict_url().
requirements.txt: minimal dependencies for local inference.

Quick start (local)

Install dependencies

pip install -r requirements.txt

Predict a single URL

from inference import load_bundle, predict_url

bundle = load_bundle("rf_url_phishing_xgboost_bst.joblib")
result = predict_url(
    url="http://secure-login-account-update.example.com/session?id=123",
    bundle=bundle,
    threshold=0.5,
)
print(result)

Bundle contents The joblib bundle contains:

model: trained XGBoost booster
feature_cols: ordered list of feature names expected by the model
url_col: original URL column name
label_col: label column name used in training
model_type: string identifying the backend (here: xgboost_bst)

License This model is provided for research and educational purposes only. Evaluate thoroughly before use in production.