|
|
--- |
|
|
pipeline_tag: text-classification |
|
|
license: apache-2.0 |
|
|
base_model: bert-large-uncased |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- phishing |
|
|
- BERT |
|
|
- cybersecurity |
|
|
- text-classification |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
model-index: |
|
|
- name: phishing-email-detector-capstone |
|
|
results: [] |
|
|
widget: |
|
|
- text: https://www.verif22.com |
|
|
example_title: Phishing URL |
|
|
- text: > |
|
|
Dear colleague, |
|
|
An important update about your email has exceeded your storage limit. |
|
|
You will not be able to send or receive messages until you reactivate your account. |
|
|
We will close all older versions of our Mailbox as of Friday, June 12, 2023. |
|
|
To activate and complete the required information, click here (https://ec-ec.squarespace.com). |
|
|
Your account must be reactivated today to regenerate new space. |
|
|
— Management Team |
|
|
example_title: Phishing Email |
|
|
- text: > |
|
|
You have access to FREE Video Streaming in your plan. |
|
|
REGISTER with your email and password, then select the monthly subscription option. |
|
|
https://bit.ly/3vNrU5r |
|
|
example_title: Phishing SMS |
|
|
- text: > |
|
|
if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);}; |
|
|
var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1"); |
|
|
var sprytextfield1 = new Spry.Widget.ValidationTextField("sprypassword1", "email"); |
|
|
example_title: Phishing Script |
|
|
- text: Hi, this model is really accurate :) |
|
|
example_title: Benign Message |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
# 🧠 Phishing Detection Model (BERT-Large-Uncased) |
|
|
|
|
|
A transformer-based model fine-tuned to detect **phishing content** across multiple formats — including **emails, URLs, SMS messages, and scripts**. |
|
|
Built on **BERT-Large-Uncased**, it leverages deep contextual understanding of language to classify text as *phishing* or *benign* with high accuracy. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📌 Model Details |
|
|
|
|
|
**Base model:** `bert-large-uncased` |
|
|
**Architecture:** 24 layers • 1024 hidden size • 16 attention heads • ~336M parameters |
|
|
**License:** Apache 2.0 |
|
|
**Language:** English |
|
|
**Pipeline tag:** `text-classification` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Model Description |
|
|
|
|
|
This model was trained to identify phishing-related content by analyzing linguistic and structural patterns commonly found in malicious communications. |
|
|
By leveraging BERT’s bidirectional transformer architecture, it effectively detects phishing attempts even when the message appears legitimate or well-written. |
|
|
|
|
|
### Key Features |
|
|
- Detects **phishing attempts** in text, emails, URLs, and scripts |
|
|
- Useful for **cybersecurity applications**, such as email gateways or web filtering systems |
|
|
- Capable of identifying **varied phishing tactics** (impersonation, link manipulation, credential harvesting, etc.) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎯 Intended Uses |
|
|
|
|
|
**Recommended use cases:** |
|
|
- Classify messages, emails, and URLs as *phishing* or *benign* |
|
|
- Integrate into automated **security pipelines**, email filtering tools, or chat moderation systems |
|
|
- Aid in **phishing research** or awareness programs |
|
|
|
|
|
**Limitations:** |
|
|
- May trigger **false positives** on legitimate content with financial or urgent language |
|
|
- Optimized for **English text** only |
|
|
- Should be part of a **multi-layered defense strategy**, not a standalone cybersecurity control |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluation Results |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|--------| |
|
|
| **Loss** | 0.1953 | |
|
|
| **Accuracy** | 0.9717 | |
|
|
| **Precision** | 0.9658 | |
|
|
| **Recall** | 0.9670 | |
|
|
| **False Positive Rate** | 0.0249 | |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Training Details |
|
|
|
|
|
### Hyperparameters |
|
|
| Parameter | Value | |
|
|
|------------|--------| |
|
|
| **Learning rate** | 2e-05 | |
|
|
| **Train batch size** | 16 | |
|
|
| **Eval batch size** | 16 | |
|
|
| **Seed** | 42 | |
|
|
| **Optimizer** | Adam (β₁=0.9, β₂=0.999, ε=1e-08) | |
|
|
| **LR scheduler** | Linear | |
|
|
| **Epochs** | 4 | |
|
|
|
|
|
### Training Results |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | False Positive Rate | |
|
|
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------:|:------:|:-------------------:| |
|
|
| 0.1487 | 1.0 | 3866 | 0.1454 | 0.9596 | 0.9709 | 0.9320 | 0.0203 | |
|
|
| 0.0805 | 2.0 | 7732 | 0.1389 | 0.9691 | 0.9663 | 0.9601 | 0.0243 | |
|
|
| 0.0389 | 3.0 | 11598 | 0.1779 | 0.9683 | 0.9778 | 0.9461 | 0.0156 | |
|
|
| 0.0091 | 4.0 | 15464 | 0.1953 | 0.9717 | 0.9658 | 0.9670 | 0.0249 | |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Example Inference |
|
|
|
|
|
Try the model in Python using the `transformers` library: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
# Load the phishing detection model |
|
|
classifier = pipeline("text-classification", model="your-username/phishing-email-detector-capstone") |
|
|
# Example texts |
|
|
examples = [ |
|
|
"Dear colleague, your email storage is full. Click here to verify your account: https://secure-update-login.com", |
|
|
"Hi team, the meeting starts at 2 PM today.", |
|
|
"You have won a free gift card! Claim now at http://bit.ly/3xYzabc" |
|
|
] |
|
|
# Run inference |
|
|
for text in examples: |
|
|
result = classifier(text)[0] |
|
|
print(f"Text: {text}\nPrediction: {result['label']} (score: {result['score']:.4f})\n") |
|
|
|