|
|
---
|
|
|
license: mit
|
|
|
tags:
|
|
|
- phishing-detection
|
|
|
- url-classification
|
|
|
- text-classification
|
|
|
- roberta
|
|
|
task: text-classification
|
|
|
datasets:
|
|
|
- custom
|
|
|
---
|
|
|
|
|
|
# Url Phishing Classifier
|
|
|
|
|
|
This model is fine-tuned for URL phishing classification. It classifies URLs as phishing (1) or safe (0).
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
This model is based on **roberta-base** and has been fine-tuned for phishing detection tasks.
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
- **Base Model**: roberta-base
|
|
|
- **Training Samples**: 1629193
|
|
|
- **Validation Samples**: 325839
|
|
|
- **Test Samples**: 217226
|
|
|
- **Epochs**: 5
|
|
|
- **Batch Size**: 24
|
|
|
- **Learning Rate**: 2e-05
|
|
|
- **Max Length**: 256
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Evaluation Results
|
|
|
|
|
|
### Test Set Metrics
|
|
|
|
|
|
- **Loss**: 0.1483
|
|
|
- **Accuracy**: 0.9463
|
|
|
- **F1**: 0.9262
|
|
|
- **Precision**: 0.9259
|
|
|
- **Recall**: 0.9264
|
|
|
- **Roc Auc**: 0.9890
|
|
|
- **True Positives**: 73116.0000
|
|
|
- **True Negatives**: 132450.0000
|
|
|
- **False Positives**: 5851.0000
|
|
|
- **False Negatives**: 5809.0000
|
|
|
- **Runtime**: 142.5284
|
|
|
- **Samples Per Second**: 1524.0900
|
|
|
- **Steps Per Second**: 31.7550
|
|
|
- **Epoch**: 5.0000
|
|
|
|
|
|
### Validation Set Metrics
|
|
|
|
|
|
- **Loss**: 0.1483
|
|
|
- **Accuracy**: 0.9455
|
|
|
- **F1**: 0.9250
|
|
|
- **Precision**: 0.9246
|
|
|
- **Recall**: 0.9255
|
|
|
- **Roc Auc**: 0.9888
|
|
|
- **True Positives**: 109566.0000
|
|
|
- **True Negatives**: 198511.0000
|
|
|
- **False Positives**: 8940.0000
|
|
|
- **False Negatives**: 8822.0000
|
|
|
- **Runtime**: 195.9861
|
|
|
- **Samples Per Second**: 1662.5610
|
|
|
- **Steps Per Second**: 34.6400
|
|
|
- **Epoch**: 5.0000
|
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
```python
|
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
import torch
|
|
|
|
|
|
# Load model and tokenizer
|
|
|
model_name = "nhellyercreek/url-phishing-classifier"
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
|
|
|
|
|
# Example inference
|
|
|
text = "Your email or URL text here"
|
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
|
|
outputs = model(**inputs)
|
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
|
|
|
|
|
# Get prediction
|
|
|
predicted_class = predictions.argmax().item()
|
|
|
confidence = predictions[0][predicted_class].item()
|
|
|
|
|
|
print(f"Predicted class: {predicted_class} (phishing=1, safe=0)")
|
|
|
print(f"Confidence: {confidence:.4f}")
|
|
|
```
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
If you use this model, please cite:
|
|
|
|
|
|
```bibtex
|
|
|
@misc{nhellyercreek_url_phishing_classifier,
|
|
|
title={Url Phishing Classifier},
|
|
|
author={Your Name},
|
|
|
year={2024},
|
|
|
publisher={Hugging Face},
|
|
|
howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier}}
|
|
|
}
|
|
|
```
|
|
|
|