| ### **π SecureBERT Phishing Detection Model** | |
| This repository hosts a fine-tuned **SecureBERT-based** model optimized for **phishing URL detection** using a cybersecurity dataset. The model classifies URLs as either **phishing (malicious)** or **safe (benign)**. | |
| --- | |
| ## **π Model Details** | |
| - **Model Architecture**: SecureBERT (Based on BERT) | |
| - **Task**: Binary Classification (Phishing vs. Safe) | |
| - **Dataset**: shashwatwork/web-page-phishing-detection-dataset (11,431 URLs, 88 features) | |
| - **Framework**: PyTorch & Hugging Face Transformers | |
| - **Input Data**: URL strings & extracted numerical features | |
| - **Number of Classes**: 2 (**Phishing, Safe**) | |
| - **Quantization**: FP16 (for efficiency) | |
| --- | |
| ## **π Usage** | |
| ### **Installation** | |
| ```bash | |
| pip install torch transformers scikit-learn pandas | |
| ``` | |
| ### **Loading the Model** | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| # Load the fine-tuned model and tokenizer | |
| model_path = "./fine_tuned_SecureBERT" | |
| tokenizer = AutoTokenizer.from_pretrained(model_path) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_path) | |
| model.eval() # Set model to evaluation mode | |
| print("β SecureBERT model loaded successfully and ready for inference!") | |
| ``` | |
| --- | |
| ### **π Perform Phishing Detection** | |
| ```python | |
| def predict_url(url): | |
| # Tokenize input | |
| encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt") | |
| # Perform inference | |
| with torch.no_grad(): | |
| output = model(**encoding) | |
| # Get predicted class | |
| predicted_class = torch.argmax(output.logits, dim=1).item() | |
| # Map label | |
| label = "Phishing" if predicted_class == 1 else "Safe" | |
| return label | |
| # Example usage | |
| custom_url = "http://example.com/free-gift" | |
| prediction = predict_url(custom_url) | |
| print(f"Predicted label: {prediction}") | |
| ``` | |
| --- | |
| ## **π Evaluation Results** | |
| After fine-tuning, the model was evaluated on a **test set**, achieving the following performance: | |
| | **Metric** | **Score** | | |
| |------------------|-----------| | |
| | **Accuracy** | 97.2% | | |
| | **Precision** | 96.8% | | |
| | **Recall** | 97.5% | | |
| | **F1-Score** | 97.1% | | |
| | **Inference Speed** | Fast (Optimized with FP16) | | |
| --- | |
| ## **π οΈ Fine-Tuning Details** | |
| ### **Dataset** | |
| The model was trained on a **shashwatwork/web-page-phishing-detection-dataset** consisting of **11,431 URLs** labeled as either **phishing** or **safe**. Features include URL characteristics, domain properties, and additional metadata. | |
| ### **Training Configuration** | |
| - **Number of epochs**: 5 | |
| - **Batch size**: 16 | |
| - **Optimizer**: AdamW | |
| - **Learning rate**: 2e-5 | |
| - **Loss Function**: Cross-Entropy | |
| - **Evaluation Strategy**: Validation at each epoch | |
| ### **Quantization** | |
| The model was quantized using **FP16 precision**, reducing latency and memory usage while maintaining high accuracy. | |
| --- | |
| ## **β οΈ Limitations** | |
| - **Evasion Techniques**: Attackers constantly evolve phishing techniques, which may reduce model effectiveness. | |
| - **Dataset Bias**: The model was trained on a specific dataset; new phishing tactics may require retraining. | |
| - **False Positives**: Some legitimate but unusual URLs might be classified as phishing. | |
| --- | |
| β **Use this fine-tuned SecureBERT model for accurate and efficient phishing detection!** ππ | |