Senior Project Notice

This repository was created for a senior project in ENGT 375 Applied Machine Learning at Old Dominion University. It is provided for educational and research demonstration purposes only. It is not intended for production use, security filtering, or making real-world spam/phishing decisions. Always use established security tools for operational email protection.

Spam Email Classifier β€” sklearn Voting Ensemble (Gradio)

⚠️ DEPRECATED β€” This model has been merged into spam-xai-classifier-v2. v2 retrains the same RF + LR + SVM VotingClassifier ensemble on the full 99,999-sample corpus (vs. the smaller subset used here) and adds the LIME / SHAP / ELI5 XAI workflow. This repository is preserved for archival reference only β€” no further updates will be made. Please use v2 for new work.

ENGT 375 β€” Applied Machine Learning | Spring 2026 | ODU A voting ensemble classifier (Random Forest + Logistic Regression + SVM) for spam email detection, with LIME and SHAP explainability support.

Model Details

  • Architecture: VotingClassifier (soft voting)
    • Random Forest
    • Logistic Regression
    • Calibrated LinearSVC
  • Features: TF-IDF (text) + 24 hand-crafted metadata features
  • Framework: scikit-learn
  • Task: Binary classification (spam / ham)

Evaluation Results

Model Accuracy Precision Recall F1 Score
VotingEnsemble 0.974 0.966 0.980 0.973
RandomForest 0.978 0.976 0.977 0.976
LogisticRegression 0.966 0.955 0.973 0.964
SVM 0.969 0.963 0.972 0.967

Training Details

Parameter Value
Training examples 70,000
Test examples 30,000
Random state 42
Optimal threshold 0.3714
Total features 3,024 (3,000 TF-IDF + 24 metadata)
Voting strategy Soft voting

Dataset

Files

File Purpose
voting_model.joblib Trained VotingClassifier ensemble (145MB)
tfidf_vectorizer.joblib Fitted TF-IDF vectorizer
meta_scaler.joblib MinMaxScaler for metadata features
feature_names.joblib Feature name list for explainability
optimal_threshold.joblib Calibrated decision threshold
training_sample.joblib Sample of training data for LIME/SHAP
training_report.json Training metrics and classification report

Usage

import joblib
from utils import preprocess_text, compute_metadata_features

model = joblib.load("voting_model.joblib")
tfidf = joblib.load("tfidf_vectorizer.joblib")
scaler = joblib.load("meta_scaler.joblib")
threshold = joblib.load("optimal_threshold.joblib")

email = "Congratulations! You've won a free iPhone!"
text_features = tfidf.transform([preprocess_text(email)])
meta_features = scaler.transform([compute_metadata_features(email)])
features = hstack([text_features, csr_matrix(meta_features)])

proba = model.predict_proba(features)[0][1]
label = "SPAM" if proba >= threshold else "HAM"

Interactive Demo

Intended Use

This model is an educational demonstration of sklearn ensemble methods with explainable AI (XAI), created as part of a university course project. It is suitable for:

  • Learning how voting ensembles combine multiple classifiers
  • Understanding TF-IDF text vectorization with metadata feature engineering
  • Exploring LIME and SHAP explanations for model interpretability

It is not intended for production spam filtering.

Limitations

  • Bag-of-words approach β€” TF-IDF cannot distinguish legitimate marketing from spam when vocabulary overlaps significantly
  • Binary classification only (spam/ham) β€” no multi-class or severity ranking
  • Trained on English emails only β€” not suitable for other languages
  • Static vocabulary β€” cannot adapt to new spam patterns without retraining
  • Threshold tuning is dataset-specific and may not generalize

Related Models

Model Description Link
spam-classifier-mlx Qwen 3.5 0.8B MLX LoRA fine-tune VoltageVagabond/spam-classifier-mlx
spam-classifier-liquid Liquid AI LFM2.5-1.2B LoRA fine-tune VoltageVagabond/spam-classifier-liquid
spam-xai-model Calibrated Random Forest with XAI VoltageVagabond/spam-xai-model

Citation

@misc{voltagevagabond2026spamgradio,
  title={Spam Email Classifier β€” sklearn Voting Ensemble (Gradio)},
  author={VoltageVagabond},
  year={2026},
  howpublished={\url{https://huggingface.co/VoltageVagabond/spam-classifier-gradio-model}},
  note={ENGT 375 β€” Applied Machine Learning, Old Dominion University, Spring 2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support