📧 Spam Detector — TF-IDF + Logistic Regression

A lightweight, free spam email/SMS classifier trained on the SMS Spam Collection dataset.

Zero cost to run — no GPU needed, CPU-only inference.

📊 Performance

Metric	Score
Accuracy	98.65%
F1 Score (spam)	94.88%
Precision (spam)	96.53%
Recall (spam)	93.29%

Evaluated on 1,115 held-out test messages (20% split).

Confusion Matrix:

              Predicted Ham   Predicted Spam
Actual Ham         961              5
Actual Spam         10            139

Only 10 false negatives (spam missed) and 5 false positives out of 1,115 test samples.

🗂️ Dataset

Source: ucirvine/sms_spam
Size: 5,574 SMS messages (4,827 ham + 747 spam)
Split: 80% train / 20% test, stratified

🚀 Quick Start

import pickle
from huggingface_hub import hf_hub_download

# Load model
tfidf_path = hf_hub_download("anu56787ty/spam-detector-tfidf-lr", "tfidf_vectorizer.pkl")
clf_path   = hf_hub_download("anu56787ty/spam-detector-tfidf-lr", "logistic_regression.pkl")

with open(tfidf_path, "rb") as f:
    tfidf = pickle.load(f)
with open(clf_path, "rb") as f:
    clf = pickle.load(f)

def predict(text):
    vec   = tfidf.transform([text])
    pred  = clf.predict(vec)[0]
    proba = clf.predict_proba(vec)[0]
    return {"label": "spam" if pred == 1 else "ham", "confidence": proba[pred]}

# Try it!
print(predict("Congratulations! You won a FREE iPhone! Click now!"))
# → {'label': 'spam', 'confidence': 0.977}

print(predict("Hey, what time are we meeting for lunch?"))
# → {'label': 'ham', 'confidence': 0.986}

🏗️ Architecture

Input text
    ↓
TF-IDF Vectorizer
  • max_features=10,000
  • ngram_range=(1,2)   ← unigrams + bigrams
  • sublinear_tf=True
    ↓
Logistic Regression
  • C=5.0
  • class_weight='balanced'   ← handles class imbalance
    ↓
Output: ham / spam + confidence score

✅ Why This Model?

Feature	Value
💰 Cost	Free — $0
⚡ Speed	< 1ms per prediction
💾 Size	~2 MB total
🖥️ Hardware	CPU only
📦 Dependencies	scikit-learn, huggingface_hub

📈 Training

Dataset : ucirvine/sms_spam (5,574 messages)
Train   : 4,459 messages
Test    : 1,115 messages
Time    : < 5 seconds on CPU

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'anu56787ty/spam-detector-tfidf-lr'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -

anu56787ty
/

spam-detector-tfidf-lr