| language: en | |
| license: other | |
| tags: | |
| - security | |
| - phishing-detection | |
| - url-classification | |
| - xgboost | |
| # Random Forest / XGBoost Model for URL Phishing Detection | |
| ## Model Details | |
| - Architecture: Gradient-boosted decision trees (XGBoost) | |
| - Input: Single URL string (no external queries) | |
| - Features: Lexical and structural URL features (lengths, symbol counts, digit ratio, IPv4 pattern, common phishing tokens, scheme/TLD heuristics) | |
| - Training data: `PhiUSIIL_Phishing_URL_Dataset.csv` | |
| - Intended use: Binary classification (phishing vs. legitimate) | |
| ## Metrics (test) | |
| - Accuracy: 0.9952 | |
| - Precision: 0.9928 | |
| - Recall: 0.9989 | |
| - F1: 0.9958 | |
| - ROC-AUC: 0.9976 | |
| ## Usage | |
| See `README.md` and `inference.py` for loading and `predict_url()`. | |
| ## Limitations and Biases | |
| - URL-only features can be evaded by sophisticated attackers. | |
| - Dataset shifts and novel TLDs may degrade performance. | |
| - Always validate on your own traffic before deployment. | |
| ## License | |
| Provided for research/educational purposes. Ensure compliance with local laws and organizational policies. | |