PhishingDistilBERT / README.md
Gaykar's picture
Update README.md
25131cd verified
---
license: cc
language:
- en
base_model:
- distilbert/distilbert-base-uncased
---
# Model Card for PhishingDistilBERT
## Model Summary
**PhishingDistilBERT** is a DistilBERT-based NLP model fine-tuned specifically for email understanding tasks, particularly phishing and suspicious email detection.
The model introduces **custom special tokens** to explicitly encode email structure such as subject, body, links, and phone numbers, making it more robust for email-based security applications.
It can be used both as:
- a **sequence classification model** for email safety detection, and
- an **embedding generator** for downstream ML pipelines (e.g., XGBoost).
---
## Model Details
### Model Description
This model is fine-tuned from `distilbert-base-uncased` on curated email datasets. During preprocessing, email-specific entities such as URLs and phone numbers are replaced with dedicated tokens, and the subject and body are explicitly separated using structural markers.
**Special Tokens Used**
- `[SSUB]`, `[ESUB]` – Start/End of Subject
- `[SBODY]`, `[EBODY]` – Start/End of Body
- `[LINK]` – URLs
- `[PHONE]` – Phone numbers
These design choices help the model better learn semantic and structural patterns commonly found in phishing emails.
- **Developed by:** Atharva Gaykar
- **Model type:** Transformer-based text classification & embedding model
- **Language:** English
- **License:** Artistic-2.0
- **Finetuned from:** distilbert/distilbert-base-uncased
---
## Intended Uses
### Primary Use Cases
- Phishing email classification
- Suspicious vs safe email detection
- Feature extraction for traditional ML models
- Email embedding generation for downstream classifiers
### Out-of-Scope Uses
- Non-text email analysis (images, attachments)
- Commercial deployment without proper evaluation and compliance
- Tasks unrelated to email or message-level text analysis
---
## Bias, Risks, and Limitations
- The model is trained on public phishing datasets and may reflect biases present in those sources.
- Performance may degrade on highly obfuscated or novel phishing techniques.
- Not recommended for direct commercial use without extensive validation.
Users should carefully evaluate the model in their target environment before deployment.
---
## How to Get Started
```python
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch
import numpy as np
bert_path = "Gaykar/PhishingDistilBERT"
tokenizer = DistilBertTokenizerFast.from_pretrained(bert_path)
model = DistilBertForSequenceClassification.from_pretrained(bert_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
def get_cls_embedding(text, model, tokenizer, device):
with torch.no_grad():
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=256
)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.distilbert(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().cpu().numpy()
return cls_embedding
text = "[SSUB] Urgent Account Alert [ESUB] [SBODY] Click [LINK] to verify your account. [EBODY]"
embedding = get_cls_embedding(text, model, tokenizer, device)
print("Embedding shape:", embedding.shape)
print("First 10 dimensions:", embedding[:10])
````
---
## Training Details
### Training Data
The model was trained using well-known phishing and email security datasets, including **CEAS**, combined with additional curated CSV sources.
### Data Preprocessing
1. Cleaned and merged multiple CSV datasets
2. Replaced:
* URLs → `[LINK]`
* Phone numbers → `[PHONE]`
3. Combined subject and body using structural tokens:
* `[SSUB]`, `[ESUB]`, `[SBODY]`, `[EBODY]`
### Training Hyperparameters
```python
training_args = TrainingArguments(
output_dir="./distilbert_safe_suspicious",
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=50,
save_total_limit=3,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
learning_rate=4e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=8,
num_train_epochs=4,
weight_decay=0.01,
logging_strategy="steps",
logging_steps=50,
seed=42,
)
```
---
## Evaluation
### Evaluation Metrics
* Accuracy
* F1 Score
### Testing Setup
* 10% held-out test split from the full dataset
### Results
* **DistilBERT (standalone):** Strong classification performance
* **DistilBERT embeddings + XGBoost + URL features:**
**99.4% accuracy**
![Evaluation Result](https://cdn-uploads.huggingface.co/production/uploads/685998a37db0a027171ecb9f/Dr3okP_bmVOxHgeaqIQDM.png)
---
## Technical Specifications
### Model Architecture
* DistilBERT encoder
* Sequence classification head
* CLS-token embedding extraction supported
### Compute Infrastructure
* **Hardware:** NVIDIA T4 GPU
* **Frameworks:** PyTorch, Hugging Face Transformers
---
## Environmental Impact
Carbon emissions were not explicitly measured.
Users may estimate emissions using the Machine Learning Impact Calculator if needed.
---
## Model Card Authors
* **Atharva Gaykar**
---
## Contact
For questions, feedback, or research collaboration, please reach out via the Hugging Face model repository.
---