Enron Email Spam Detector

A lightweight, fully free / CPU-only email spam classifier. No GPU, no API costs, no large language model — just classic, fast, and accurate ML.

Model: TF-IDF (1–2 gram) + Logistic Regression (scikit-learn Pipeline)
Trained on: SetFit/enron_spam (real Enron emails, subject + body)
Size: ~4.6 MB · Inference: sub-millisecond on CPU

Test-set results (2,000 held-out emails)

Metric	Score
Accuracy	0.9900
Precision (spam)	0.9862
Recall (spam)	0.9940
F1	0.9901
ROC-AUC	0.9996

Confusion matrix [ham, spam]: [[978, 14], [6, 1002]]

Usage

pip install scikit-learn joblib huggingface_hub

import joblib
from huggingface_hub import hf_hub_download

path = hf_hub_download("Anurag43/enron-spam-detector", "spam_model.joblib")
model = joblib.load(path)

def is_spam(text):
    p = float(model.predict_proba([text])[0][1])
    return {"label": "spam" if p >= 0.5 else "ham", "spam_probability": round(p, 4)}

print(is_spam("URGENT! You won a $1000 gift card, click now!!!"))
# {'label': 'spam', 'spam_probability': 0.999}

For email input, concatenate the subject and body: text = subject + "\n" + body.

Training

Reproduce with train.py in this repo. Data is cleaned (empty texts dropped, duplicates removed: 31,716 → 28,811 train rows) before fitting.

License

MIT. Trained on the public Enron spam corpus.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Downloads last month: -; Downloads are not tracked for this model. How to track