Perth0603 commited on
Commit
b0ecf99
·
verified ·
1 Parent(s): ab41a73

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Random Forest / XGBoost Model for URL Phishing Detection
2
+
3
+ This repository contains a trained tree-based classifier for detecting phishing URLs. The model was trained from the `PhiUSIIL_Phishing_URL_Dataset.csv` with lightweight, URL-only lexical and structural features. On the held-out test split it achieved high accuracy and F1.
4
+
5
+ Highlights
6
+ - Backend: gradient-boosted trees via XGBoost (uses GPU if available; falls back to CPU).
7
+ - Input: raw URL string only (no external DNS/WHOIS calls needed).
8
+ - Features: length, character counts, digit ratio, IPv4 presence, common phishing tokens, scheme/TLD heuristics.
9
+
10
+ Test metrics (from notebook)
11
+ - accuracy: 0.9952
12
+ - precision: 0.9928
13
+ - recall: 0.9989
14
+ - f1: 0.9958
15
+ - roc_auc: 0.9976
16
+
17
+ Files
18
+ - `rf_url_phishing_xgboost_bst.joblib`: joblib bundle with the trained model and metadata.
19
+ - `inference.py`: helpers to load the bundle and run `predict_url()`.
20
+ - `requirements.txt`: minimal dependencies for local inference.
21
+
22
+ Quick start (local)
23
+ 1) Install dependencies
24
+ ```bash
25
+ pip install -r requirements.txt
26
+ ```
27
+
28
+ 2) Predict a single URL
29
+ ```python
30
+ from inference import load_bundle, predict_url
31
+
32
+ bundle = load_bundle("rf_url_phishing_xgboost_bst.joblib")
33
+ result = predict_url(
34
+ url="http://secure-login-account-update.example.com/session?id=123",
35
+ bundle=bundle,
36
+ threshold=0.5,
37
+ )
38
+ print(result)
39
+ ```
40
+
41
+ Bundle contents
42
+ The joblib bundle contains:
43
+ - `model`: trained XGBoost booster
44
+ - `feature_cols`: ordered list of feature names expected by the model
45
+ - `url_col`: original URL column name
46
+ - `label_col`: label column name used in training
47
+ - `model_type`: string identifying the backend (here: `xgboost_bst`)
48
+
49
+ License
50
+ This model is provided for research and educational purposes only. Evaluate thoroughly before use in production.
51
+
52
+