Perth0603's picture
Upload README.md with huggingface_hub
b0ecf99 verified
|
raw
history blame
1.83 kB

Random Forest / XGBoost Model for URL Phishing Detection

This repository contains a trained tree-based classifier for detecting phishing URLs. The model was trained from the PhiUSIIL_Phishing_URL_Dataset.csv with lightweight, URL-only lexical and structural features. On the held-out test split it achieved high accuracy and F1.

Highlights

  • Backend: gradient-boosted trees via XGBoost (uses GPU if available; falls back to CPU).
  • Input: raw URL string only (no external DNS/WHOIS calls needed).
  • Features: length, character counts, digit ratio, IPv4 presence, common phishing tokens, scheme/TLD heuristics.

Test metrics (from notebook)

  • accuracy: 0.9952
  • precision: 0.9928
  • recall: 0.9989
  • f1: 0.9958
  • roc_auc: 0.9976

Files

  • rf_url_phishing_xgboost_bst.joblib: joblib bundle with the trained model and metadata.
  • inference.py: helpers to load the bundle and run predict_url().
  • requirements.txt: minimal dependencies for local inference.

Quick start (local)

  1. Install dependencies
pip install -r requirements.txt
  1. Predict a single URL
from inference import load_bundle, predict_url

bundle = load_bundle("rf_url_phishing_xgboost_bst.joblib")
result = predict_url(
    url="http://secure-login-account-update.example.com/session?id=123",
    bundle=bundle,
    threshold=0.5,
)
print(result)

Bundle contents The joblib bundle contains:

  • model: trained XGBoost booster
  • feature_cols: ordered list of feature names expected by the model
  • url_col: original URL column name
  • label_col: label column name used in training
  • model_type: string identifying the backend (here: xgboost_bst)

License This model is provided for research and educational purposes only. Evaluate thoroughly before use in production.