--- license: cc language: - en base_model: - distilbert/distilbert-base-uncased --- # Model Card for PhishingDistilBERT ## Model Summary **PhishingDistilBERT** is a DistilBERT-based NLP model fine-tuned specifically for email understanding tasks, particularly phishing and suspicious email detection. The model introduces **custom special tokens** to explicitly encode email structure such as subject, body, links, and phone numbers, making it more robust for email-based security applications. It can be used both as: - a **sequence classification model** for email safety detection, and - an **embedding generator** for downstream ML pipelines (e.g., XGBoost). --- ## Model Details ### Model Description This model is fine-tuned from `distilbert-base-uncased` on curated email datasets. During preprocessing, email-specific entities such as URLs and phone numbers are replaced with dedicated tokens, and the subject and body are explicitly separated using structural markers. **Special Tokens Used** - `[SSUB]`, `[ESUB]` – Start/End of Subject - `[SBODY]`, `[EBODY]` – Start/End of Body - `[LINK]` – URLs - `[PHONE]` – Phone numbers These design choices help the model better learn semantic and structural patterns commonly found in phishing emails. - **Developed by:** Atharva Gaykar - **Model type:** Transformer-based text classification & embedding model - **Language:** English - **License:** Artistic-2.0 - **Finetuned from:** distilbert/distilbert-base-uncased --- ## Intended Uses ### Primary Use Cases - Phishing email classification - Suspicious vs safe email detection - Feature extraction for traditional ML models - Email embedding generation for downstream classifiers ### Out-of-Scope Uses - Non-text email analysis (images, attachments) - Commercial deployment without proper evaluation and compliance - Tasks unrelated to email or message-level text analysis --- ## Bias, Risks, and Limitations - The model is trained on public phishing datasets and may reflect biases present in those sources. - Performance may degrade on highly obfuscated or novel phishing techniques. - Not recommended for direct commercial use without extensive validation. Users should carefully evaluate the model in their target environment before deployment. --- ## How to Get Started ```python from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification import torch import numpy as np bert_path = "Gaykar/PhishingDistilBERT" tokenizer = DistilBertTokenizerFast.from_pretrained(bert_path) model = DistilBertForSequenceClassification.from_pretrained(bert_path) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() def get_cls_embedding(text, model, tokenizer, device): with torch.no_grad(): inputs = tokenizer( text, return_tensors="pt", truncation=True, padding=True, max_length=256 ) inputs = {k: v.to(device) for k, v in inputs.items()} outputs = model.distilbert(**inputs) cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().cpu().numpy() return cls_embedding text = "[SSUB] Urgent Account Alert [ESUB] [SBODY] Click [LINK] to verify your account. [EBODY]" embedding = get_cls_embedding(text, model, tokenizer, device) print("Embedding shape:", embedding.shape) print("First 10 dimensions:", embedding[:10]) ```` --- ## Training Details ### Training Data The model was trained using well-known phishing and email security datasets, including **CEAS**, combined with additional curated CSV sources. ### Data Preprocessing 1. Cleaned and merged multiple CSV datasets 2. Replaced: * URLs → `[LINK]` * Phone numbers → `[PHONE]` 3. Combined subject and body using structural tokens: * `[SSUB]`, `[ESUB]`, `[SBODY]`, `[EBODY]` ### Training Hyperparameters ```python training_args = TrainingArguments( output_dir="./distilbert_safe_suspicious", eval_strategy="steps", eval_steps=50, save_strategy="steps", save_steps=50, save_total_limit=3, load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False, learning_rate=4e-5, per_device_train_batch_size=16, per_device_eval_batch_size=8, num_train_epochs=4, weight_decay=0.01, logging_strategy="steps", logging_steps=50, seed=42, ) ``` --- ## Evaluation ### Evaluation Metrics * Accuracy * F1 Score ### Testing Setup * 10% held-out test split from the full dataset ### Results * **DistilBERT (standalone):** Strong classification performance * **DistilBERT embeddings + XGBoost + URL features:** **99.4% accuracy** ![Evaluation Result](https://cdn-uploads.huggingface.co/production/uploads/685998a37db0a027171ecb9f/Dr3okP_bmVOxHgeaqIQDM.png) --- ## Technical Specifications ### Model Architecture * DistilBERT encoder * Sequence classification head * CLS-token embedding extraction supported ### Compute Infrastructure * **Hardware:** NVIDIA T4 GPU * **Frameworks:** PyTorch, Hugging Face Transformers --- ## Environmental Impact Carbon emissions were not explicitly measured. Users may estimate emissions using the Machine Learning Impact Calculator if needed. --- ## Model Card Authors * **Atharva Gaykar** --- ## Contact For questions, feedback, or research collaboration, please reach out via the Hugging Face model repository. ---