Enron Email Spam Detector
A lightweight, fully free / CPU-only email spam classifier. No GPU, no API costs, no large language model — just classic, fast, and accurate ML.
- Model: TF-IDF (1–2 gram) + Logistic Regression (scikit-learn
Pipeline) - Trained on: SetFit/enron_spam (real Enron emails, subject + body)
- Size: ~4.6 MB · Inference: sub-millisecond on CPU
Test-set results (2,000 held-out emails)
| Metric | Score |
|---|---|
| Accuracy | 0.9900 |
| Precision (spam) | 0.9862 |
| Recall (spam) | 0.9940 |
| F1 | 0.9901 |
| ROC-AUC | 0.9996 |
Confusion matrix [ham, spam]: [[978, 14], [6, 1002]]
Usage
pip install scikit-learn joblib huggingface_hub
import joblib
from huggingface_hub import hf_hub_download
path = hf_hub_download("Anurag43/enron-spam-detector", "spam_model.joblib")
model = joblib.load(path)
def is_spam(text):
p = float(model.predict_proba([text])[0][1])
return {"label": "spam" if p >= 0.5 else "ham", "spam_probability": round(p, 4)}
print(is_spam("URGENT! You won a $1000 gift card, click now!!!"))
# {'label': 'spam', 'spam_probability': 0.999}
For email input, concatenate the subject and body: text = subject + "\n" + body.
Training
Reproduce with train.py in this repo. Data is cleaned (empty texts dropped, duplicates removed: 31,716 → 28,811 train rows) before fitting.
License
MIT. Trained on the public Enron spam corpus.
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern