PhilBERT / README.md

Update README.md

841c9c1 verified about 1 year ago

3.88 kB

	# PhilBERT: Phishing Detection with DistilBERT

	PhilBERT is a fine-tuned DistilBERT model optimized for detecting phishing threats across multiple communication channels, including emails, SMS, URLs, and websites. It is trained on a diverse dataset sourced from Kaggle, Mendeley, Phishing.Database, and Bancolombia, ensuring high adaptability and real-world applicability.

	---

	## Key Features

	- Multi-Channel Detection – Analyzes text, URLs, and web content to detect phishing patterns.
	- Fine-Tuned on Real-World Data – Includes recent three months of financial institution data (Bancolombia).
	- Lightweight & Efficient – Based on DistilBERT, providing high performance with reduced computational costs.
	- High Accuracy – Achieves 85.22% precision, 93.81% recall, and 88.77% accuracy on unseen data.
	- Self-Adaptive Learning – Continuously evolves using real-time phishing simulations generated with GPT-4o.
	- Scalability – Designed to support 7,000–25,000 simultaneous users in production environments.

	---

	## Model Architecture

	PhilBERT leverages DistilBERT, a distilled version of BERT, maintaining the same architecture but with 40% fewer parameters, making it lightweight while preserving high accuracy. The final model includes:

	- Tokenizer: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
	- Custom Classifier: A fully connected dense layer added for binary classification (phishing vs. benign).
	- Risk Scoring Mechanism: A weighted confidence score applied to enhance detection reliability.

	---

	## Data Preprocessing

	Before fine-tuning, the dataset underwent extensive preprocessing to ensure balance and quality:

	- Duplicate Removal & Balancing: Maintained a near 50-50 phishing-to-benign ratio to prevent model bias.
	- Feature Extraction: Applied to URLs, HTML, email bodies, and SMS content to enrich input representations.
	- Dataset Split: Final dataset included:
	- 427,028 benign URLs & 381,014 phishing URLs
	- 17,536 unique email samples
	- 5,949 SMS samples
	- Web entries filtered for efficiency (removing entries >100KB).
	- Export Format: Data transformed and stored in JSON for efficient training.

	---

	## Training & Evaluation

	PhilBERT was fine-tuned on multi-modal phishing datasets using transfer learning, achieving:

	\| Metric \| Value \|
	\|---------------------\|------------\|
	\| Accuracy \| 88.77% \|
	\| Precision \| 85.22% \|
	\| Recall \| 93.81% \|
	\| F1-Score \| 89.31% \|
	\| Evaluation Runtime \| 130.46s \|
	\| Samples/sec \| 58.701 \|

	- False Positive Reduction: Multi-layered filtering minimized false positives while maintaining high recall.
	- Scalability: Successfully stress-tested for up to 25,000 simultaneous users.
	- Compliance: Meets ISO 27001 and GDPR standards for security and privacy.

	---

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Inference

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "your_username/PhilBERT"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "Click this link to update your bank details: http://fakebank.com"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

	print(f"Phishing probability: {predictions[0][1].item():.4f}")
	```

	---

	## License

	This model is proprietary and protected under a custom license. Please refer to the [LICENSE](LICENSE) file for terms of use.

	---

	# PhilBERT: Phishing Detection with DistilBERT

	PhilBERT is a fine-tuned DistilBERT model optimized for detecting phishing threats across multiple communication channels, including emails, SMS, URLs, and websites. It is trained on a diverse dataset sourced from Kaggle, Mendeley, Phishing.Database, and Bancolombia, ensuring high adaptability and real-world applicability.

	---

	## Key Features

	- Multi-Channel Detection – Analyzes text, URLs, and web content to detect phishing patterns.
	- Fine-Tuned on Real-World Data – Includes recent three months of financial institution data (Bancolombia).
	- Lightweight & Efficient – Based on DistilBERT, providing high performance with reduced computational costs.
	- High Accuracy – Achieves 85.22% precision, 93.81% recall, and 88.77% accuracy on unseen data.
	- Self-Adaptive Learning – Continuously evolves using real-time phishing simulations generated with GPT-4o.
	- Scalability – Designed to support 7,000–25,000 simultaneous users in production environments.

	---

	## Model Architecture

	PhilBERT leverages DistilBERT, a distilled version of BERT, maintaining the same architecture but with 40% fewer parameters, making it lightweight while preserving high accuracy. The final model includes:

	- Tokenizer: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
	- Custom Classifier: A fully connected dense layer added for binary classification (phishing vs. benign).
	- Risk Scoring Mechanism: A weighted confidence score applied to enhance detection reliability.

	---

	## Data Preprocessing

	Before fine-tuning, the dataset underwent extensive preprocessing to ensure balance and quality:

	- Duplicate Removal & Balancing: Maintained a near 50-50 phishing-to-benign ratio to prevent model bias.
	- Feature Extraction: Applied to URLs, HTML, email bodies, and SMS content to enrich input representations.
	- Dataset Split: Final dataset included:
	- 427,028 benign URLs & 381,014 phishing URLs
	- 17,536 unique email samples
	- 5,949 SMS samples
	- Web entries filtered for efficiency (removing entries >100KB).
	- Export Format: Data transformed and stored in JSON for efficient training.

	---

	## Training & Evaluation

	PhilBERT was fine-tuned on multi-modal phishing datasets using transfer learning, achieving:

	\| Metric \| Value \|
	\|---------------------\|------------\|
	\| Accuracy \| 88.77% \|
	\| Precision \| 85.22% \|
	\| Recall \| 93.81% \|
	\| F1-Score \| 89.31% \|
	\| Evaluation Runtime \| 130.46s \|
	\| Samples/sec \| 58.701 \|

	- False Positive Reduction: Multi-layered filtering minimized false positives while maintaining high recall.
	- Scalability: Successfully stress-tested for up to 25,000 simultaneous users.
	- Compliance: Meets ISO 27001 and GDPR standards for security and privacy.

	---

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Inference

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "your_username/PhilBERT"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "Click this link to update your bank details: http://fakebank.com"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

	print(f"Phishing probability: {predictions[0][1].item():.4f}")
	```

	---

	## License

	This model is proprietary and protected under a custom license. Please refer to the [LICENSE](LICENSE) file for terms of use.

	---