drorrabin/phishing_emails-data
Viewer • Updated • 30.7k • 99 • 8
A fine-tuned BERT (bert-base-uncased) classifier for phishing email detection, developed as part of the NTUST MI5137701 course term project (Group 72).
This model classifies email text into two categories:
phishing email (label 1)safe email (label 0)The model is trained on body-only email text (headers stripped) to avoid learning dataset artifacts rather than genuine phishing indicators.
| Parameter | Value |
|---|---|
| Base model | bert-base-uncased (110M params) |
| Architecture | BertForSequenceClassification |
| Labels | 2 (phishing email, safe email) |
| Learning rate | 2e-5 |
| Optimizer | AdamW |
| Loss | Cross-entropy |
| Epochs | 3 |
| Max sequence length | 512 tokens |
| Training dataset | drorrabin/phishing_emails-data (26,946 emails, 50/50 split) |
| Random seed | 42 |
All results are evaluated on a disjoint test corpus (Phishing_Email.csv, 18,460 emails) that has zero overlap with the training data.
| Metric | Value |
|---|---|
| Accuracy | 80.71% |
| F1 (macro) | 71.22% |
| Recall | 60.43% |
| Precision | 86.70% |
| ROC-AUC | 0.919 |
| ECE | 0.178 |
| Metric | Value |
|---|---|
| Accuracy | 81.07% |
| F1 (macro) | 71.94% |
| Recall | 61.45% |
| Precision | 86.75% |
| Escalation rate | 1.21% |
| Amortized latency | ~0.36s/email |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("harrynguyen5/phishing-bert-model")
model = AutoModelForSequenceClassification.from_pretrained("harrynguyen5/phishing-bert-model")
email_text = "Dear user, your account has been compromised. Click here to verify."
inputs = tokenizer(email_text, return_tensors="pt", truncation=True, max_length=512, padding=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
prediction = torch.argmax(probs, dim=-1).item()
label = model.config.id2label[prediction]
confidence = probs[0][prediction].item()
print(f"Prediction: {label} (confidence: {confidence:.4f})")
ceas-challenge header). The cross-corpus results above are the reliable benchmark.| Member | Student ID | Role |
|---|---|---|
| Bui The Hien | M11409806 | Dataset & Fine-Tuning |
| Le Trung Kien | M11415803 | Evaluation & Benchmarking |
| Nguyen Quoc Nguyen | M11409814 | Prompt Pipeline & Logic |