Spam Email Classifier
This model classifies emails as spam or ham (legitimate) with 96%+ accuracy.
Model Details
- Model Type: Ensemble (MultinomialNB + Logistic Regression) or best performer
- Training Data: 190,000 spam/ham emails from Kaggle
- Features: TF-IDF vectorization with 10,000 features and trigrams
- Accuracy: 96%+ on test set
Files Included
spam_classifier_model.pkl- Trained classification modeltfidf_vectorizer.pkl- TF-IDF vectorizer (required for inference)
Usage
from huggingface_hub import hf_hub_download
import pickle
import re
# Download model and vectorizer
model_path = hf_hub_download(
repo_id="satyam2025/spam-email-classifier",
filename="spam_classifier_model.pkl"
)
vectorizer_path = hf_hub_download(
repo_id="satyam2025/spam-email-classifier",
filename="tfidf_vectorizer.pkl"
)
# Load files
with open(model_path, 'rb') as f:
model = pickle.load(f)
with open(vectorizer_path, 'rb') as f:
vectorizer = pickle.load(f)
# Preprocessing function
def clean_email_text(text):
text = text.lower()
text = re.sub(r'\S*@\S*\s?', '', text)
text = re.sub(r'http\S+|www\.\S+', '', text)
text = re.sub(r'<.*?>', '', text)
text = re.sub(r'[^a-zA-Z\s!?.]', ' ', text)
text = ' '.join(text.split())
return text
# Predict function
def predict_spam(email_text, threshold=0.8):
cleaned = clean_email_text(email_text)
features = vectorizer.transform([cleaned])
spam_probability = model.predict_proba(features)[0][1]
is_spam = spam_probability >= threshold
return {
'spam_probability': spam_probability,
'is_spam': is_spam,
'classification': 'SPAM' if is_spam else 'HAM'
}
# Example
email = "Congratulations! You won $1000. Click here now!"
result = predict_spam(email)
print(result)
# Output: {'spam_probability': 0.966, 'is_spam': True, 'classification': 'SPAM'}
Performance Metrics
- Accuracy: 96%+
- Precision (Spam): 95%+
- Recall (Spam): 91%+
- F1-Score: 93%+
Training Details
- Dataset Size: 190,000 emails
- Training Split: 80/20
- Preprocessing: URL removal, email removal, punctuation normalization
- Vectorization: TF-IDF with trigrams (1-3 word combinations)
- Models Tested: MultinomialNB, Logistic Regression, Ensemble
Limitations
- Trained on English emails only
- May not perform well on non-standard text formats
- Requires both
.pklfiles for inference
Citation
@misc{spam-classifier-2024,
author = {Satyam},
title = {Spam Email Classifier},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/satyam2025/spam-email-classifier}}
}
License
MIT License
- Downloads last month
- -
Evaluation results
- accuracyself-reported0.960