Senior Project Notice

This repository was created for a senior project in ENGT 375 Applied Machine Learning at Old Dominion University. It is provided for educational and research demonstration purposes only. It is not intended for production use, security filtering, or making real-world spam/phishing decisions. Always use established security tools for operational email protection.

Spam Email Classifier — sklearn Voting Ensemble (Gradio)

⚠️ DEPRECATED — This model has been merged into spam-xai-classifier-v2. v2 retrains the same RF + LR + SVM VotingClassifier ensemble on the full 99,999-sample corpus (vs. the smaller subset used here) and adds the LIME / SHAP / ELI5 XAI workflow. This repository is preserved for archival reference only — no further updates will be made. Please use v2 for new work.

ENGT 375 — Applied Machine Learning | Spring 2026 | ODU A voting ensemble classifier (Random Forest + Logistic Regression + SVM) for spam email detection, with LIME and SHAP explainability support.

Model Details

Architecture: VotingClassifier (soft voting)
- Random Forest
- Logistic Regression
- Calibrated LinearSVC
Features: TF-IDF (text) + 24 hand-crafted metadata features
Framework: scikit-learn
Task: Binary classification (spam / ham)

Evaluation Results

Model	Accuracy	Precision	Recall	F1 Score
VotingEnsemble	0.974	0.966	0.980	0.973
RandomForest	0.978	0.976	0.977	0.976
LogisticRegression	0.966	0.955	0.973	0.964
SVM	0.969	0.963	0.972	0.967

Training Details

Parameter	Value
Training examples	70,000
Test examples	30,000
Random state	42
Optimal threshold	0.3714
Total features	3,024 (3,000 TF-IDF + 24 metadata)
Voting strategy	Soft voting

Dataset

VoltageVagabond/spam-email-dataset
Sources: Kaggle 190K spam/ham + GitHub email-dataset

Files

File	Purpose
`voting_model.joblib`	Trained VotingClassifier ensemble (145MB)
`tfidf_vectorizer.joblib`	Fitted TF-IDF vectorizer
`meta_scaler.joblib`	MinMaxScaler for metadata features
`feature_names.joblib`	Feature name list for explainability
`optimal_threshold.joblib`	Calibrated decision threshold
`training_sample.joblib`	Sample of training data for LIME/SHAP
`training_report.json`	Training metrics and classification report

Usage

import joblib
from utils import preprocess_text, compute_metadata_features

model = joblib.load("voting_model.joblib")
tfidf = joblib.load("tfidf_vectorizer.joblib")
scaler = joblib.load("meta_scaler.joblib")
threshold = joblib.load("optimal_threshold.joblib")

email = "Congratulations! You've won a free iPhone!"
text_features = tfidf.transform([preprocess_text(email)])
meta_features = scaler.transform([compute_metadata_features(email)])
features = hstack([text_features, csr_matrix(meta_features)])

proba = model.predict_proba(features)[0][1]
label = "SPAM" if proba >= threshold else "HAM"

Interactive Demo

Gradio Space

Intended Use

This model is an educational demonstration of sklearn ensemble methods with explainable AI (XAI), created as part of a university course project. It is suitable for:

Learning how voting ensembles combine multiple classifiers
Understanding TF-IDF text vectorization with metadata feature engineering
Exploring LIME and SHAP explanations for model interpretability

It is not intended for production spam filtering.

Limitations

Bag-of-words approach — TF-IDF cannot distinguish legitimate marketing from spam when vocabulary overlaps significantly
Binary classification only (spam/ham) — no multi-class or severity ranking
Trained on English emails only — not suitable for other languages
Static vocabulary — cannot adapt to new spam patterns without retraining
Threshold tuning is dataset-specific and may not generalize

Related Models

Model	Description	Link
spam-classifier-mlx	Qwen 3.5 0.8B MLX LoRA fine-tune	VoltageVagabond/spam-classifier-mlx
spam-classifier-liquid	Liquid AI LFM2.5-1.2B LoRA fine-tune	VoltageVagabond/spam-classifier-liquid
spam-xai-model	Calibrated Random Forest with XAI	VoltageVagabond/spam-xai-model

Citation

@misc{voltagevagabond2026spamgradio,
  title={Spam Email Classifier — sklearn Voting Ensemble (Gradio)},
  author={VoltageVagabond},
  year={2026},
  howpublished={\url{https://huggingface.co/VoltageVagabond/spam-classifier-gradio-model}},
  note={ENGT 375 — Applied Machine Learning, Old Dominion University, Spring 2026}
}

Downloads last month: -