AventIQ-AI
/

Bert_Email_Spam_Detaction

Model card Files Files and versions

Bert_Email_Spam_Detaction / README.md

DeepakKumarMSL's picture

Update README.md

5aef373 verified 11 months ago

|

history blame contribute delete

2.62 kB

	# Email-spam-detection

	This model detects whether an email message is spam or not spam (ham) using a fine-tuned transformer-based classifier.

	## Model Details

	### Model Description

	This is a binary text classification model trained to distinguish spam emails from legitimate (ham) emails. The model is based on a pretrained transformer architecture (e.g., BERT, RoBERTa) and fine-tuned on a labeled email dataset containing both spam and non-spam messages.

	- Model type: Transformer-based binary classifier
	- Language(s) (NLP): English
	- License: MIT License
	- Finetuned from model : bert-base-uncased (example)

	### Model Sources [optional]

	- Repository: https://huggingface.co/AventIQ-AI/email-spam-detection

	## Loading the Model

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("deepak/email-spam-detection")
	model = AutoModelForSequenceClassification.from_pretrained("deepak/email-spam-detection")

	emails = [
	"Congratulations! You have won a $1000 gift card. Click here to claim.",
	"Meeting moved to 3 PM today in the conference room.",
	]

	inputs = tokenizer(emails, padding=True, truncation=True, return_tensors="pt")
	outputs = model(**inputs)
	logits = outputs.logits
	predictions = torch.argmax(logits, dim=-1)
	print(predictions) # 1 = spam, 0 = ham
	```
	## Training Details

	The model was trained on a labeled dataset of emails combining public spam corpora such as the Enron Spam dataset and other sources, balanced between spam and ham emails. Data preprocessing included cleaning email text, removing metadata, and tokenization.

	## Training Procedure

	Emails were normalized by removing special characters and tokenized using the pretrained tokenizer.

	# Training Hyperparameters

	Training regime: fine-tuning with fp16 mixed precision on NVIDIA GPUs

	- Batch size: 32

	- Learning rate: 2e-5

	- Epochs: 4

	- Speeds, Sizes, Times [optional]
	- Checkpoint size: ~400MB

	- Training time: ~3 hours on 1 GPU

	## Evaluation

	Testing Data, Factors & Metrics
	Testing Data
	Evaluation was performed on a held-out test split from the same dataset, containing unseen emails.

	# Factors
	- No explicit subpopulation disaggregation.

	# Metrics
	- Accuracy

	- Precision

	- Recall

	- F1-score

	## Results

	Metric : Score
	Accuracy : 0.95
	Precision : 0.93
	Recall : 0.92
	F1-score : 0.925

	## Model Examination
	Attention analysis indicates the model focuses on key spam indicators like suspicious URLs, urgent calls to action, and financial keywords.