Spam Email Classifier

This model classifies emails as spam or ham (legitimate) with 96%+ accuracy.

Model Details

Model Type: Ensemble (MultinomialNB + Logistic Regression) or best performer
Training Data: 190,000 spam/ham emails from Kaggle
Features: TF-IDF vectorization with 10,000 features and trigrams
Accuracy: 96%+ on test set

Files Included

spam_classifier_model.pkl - Trained classification model
tfidf_vectorizer.pkl - TF-IDF vectorizer (required for inference)

Usage

from huggingface_hub import hf_hub_download
import pickle
import re

# Download model and vectorizer
model_path = hf_hub_download(
    repo_id="satyam2025/spam-email-classifier",
    filename="spam_classifier_model.pkl"
)
vectorizer_path = hf_hub_download(
    repo_id="satyam2025/spam-email-classifier",
    filename="tfidf_vectorizer.pkl"
)

# Load files
with open(model_path, 'rb') as f:
    model = pickle.load(f)
with open(vectorizer_path, 'rb') as f:
    vectorizer = pickle.load(f)

# Preprocessing function
def clean_email_text(text):
    text = text.lower()
    text = re.sub(r'\S*@\S*\s?', '', text)
    text = re.sub(r'http\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s!?.]', ' ', text)
    text = ' '.join(text.split())
    return text

# Predict function
def predict_spam(email_text, threshold=0.8):
    cleaned = clean_email_text(email_text)
    features = vectorizer.transform([cleaned])
    spam_probability = model.predict_proba(features)[0][1]
    is_spam = spam_probability >= threshold
    return {
        'spam_probability': spam_probability,
        'is_spam': is_spam,
        'classification': 'SPAM' if is_spam else 'HAM'
    }

# Example
email = "Congratulations! You won $1000. Click here now!"
result = predict_spam(email)
print(result)
# Output: {'spam_probability': 0.966, 'is_spam': True, 'classification': 'SPAM'}

Performance Metrics

Accuracy: 96%+
Precision (Spam): 95%+
Recall (Spam): 91%+
F1-Score: 93%+

Training Details

Dataset Size: 190,000 emails
Training Split: 80/20
Preprocessing: URL removal, email removal, punctuation normalization
Vectorization: TF-IDF with trigrams (1-3 word combinations)
Models Tested: MultinomialNB, Logistic Regression, Ensemble

Limitations

Trained on English emails only
May not perform well on non-standard text formats
Requires both .pkl files for inference

Citation

@misc{spam-classifier-2024,
  author = {Satyam},
  title = {Spam Email Classifier},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/satyam2025/spam-email-classifier}}
}

License

MIT License

Downloads last month: -

Space using satyam2025/spam-email-classifier 1

Evaluation results

accuracy
self-reported

0.960