πŸ“§ Spam Detector β€” TF-IDF + Logistic Regression

A lightweight, free spam email/SMS classifier trained on the SMS Spam Collection dataset.

Zero cost to run β€” no GPU needed, CPU-only inference.

πŸ“Š Performance

Metric Score
Accuracy 98.65%
F1 Score (spam) 94.88%
Precision (spam) 96.53%
Recall (spam) 93.29%

Evaluated on 1,115 held-out test messages (20% split).

Confusion Matrix:

              Predicted Ham   Predicted Spam
Actual Ham         961              5
Actual Spam         10            139

Only 10 false negatives (spam missed) and 5 false positives out of 1,115 test samples.

πŸ—‚οΈ Dataset

  • Source: ucirvine/sms_spam
  • Size: 5,574 SMS messages (4,827 ham + 747 spam)
  • Split: 80% train / 20% test, stratified

πŸš€ Quick Start

import pickle
from huggingface_hub import hf_hub_download

# Load model
tfidf_path = hf_hub_download("anu56787ty/spam-detector-tfidf-lr", "tfidf_vectorizer.pkl")
clf_path   = hf_hub_download("anu56787ty/spam-detector-tfidf-lr", "logistic_regression.pkl")

with open(tfidf_path, "rb") as f:
    tfidf = pickle.load(f)
with open(clf_path, "rb") as f:
    clf = pickle.load(f)

def predict(text):
    vec   = tfidf.transform([text])
    pred  = clf.predict(vec)[0]
    proba = clf.predict_proba(vec)[0]
    return {"label": "spam" if pred == 1 else "ham", "confidence": proba[pred]}

# Try it!
print(predict("Congratulations! You won a FREE iPhone! Click now!"))
# β†’ {'label': 'spam', 'confidence': 0.977}

print(predict("Hey, what time are we meeting for lunch?"))
# β†’ {'label': 'ham', 'confidence': 0.986}

πŸ—οΈ Architecture

Input text
    ↓
TF-IDF Vectorizer
  β€’ max_features=10,000
  β€’ ngram_range=(1,2)   ← unigrams + bigrams
  β€’ sublinear_tf=True
    ↓
Logistic Regression
  β€’ C=5.0
  β€’ class_weight='balanced'   ← handles class imbalance
    ↓
Output: ham / spam + confidence score

βœ… Why This Model?

Feature Value
πŸ’° Cost Free β€” $0
⚑ Speed < 1ms per prediction
πŸ’Ύ Size ~2 MB total
πŸ–₯️ Hardware CPU only
πŸ“¦ Dependencies scikit-learn, huggingface_hub

πŸ“ˆ Training

Dataset : ucirvine/sms_spam (5,574 messages)
Train   : 4,459 messages
Test    : 1,115 messages
Time    : < 5 seconds on CPU

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'anu56787ty/spam-detector-tfidf-lr'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train anu56787ty/spam-detector-tfidf-lr