Instructions to use VoltageVagabond/spam-classifier-gradio with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use VoltageVagabond/spam-classifier-gradio with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("VoltageVagabond/spam-classifier-gradio", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
Senior Project Notice
This repository was created for a senior project in ENGT 375 Applied Machine Learning at Old Dominion University. It is provided for educational and research demonstration purposes only. It is not intended for production use, security filtering, or making real-world spam/phishing decisions. Always use established security tools for operational email protection.
Spam Email Classifier β sklearn Voting Ensemble (Gradio)
β οΈ DEPRECATED β This model has been merged into
spam-xai-classifier-v2. v2 retrains the same RF + LR + SVM VotingClassifier ensemble on the full 99,999-sample corpus (vs. the smaller subset used here) and adds the LIME / SHAP / ELI5 XAI workflow. This repository is preserved for archival reference only β no further updates will be made. Please use v2 for new work.
ENGT 375 β Applied Machine Learning | Spring 2026 | ODU A voting ensemble classifier (Random Forest + Logistic Regression + SVM) for spam email detection, with LIME and SHAP explainability support.
Model Details
- Architecture: VotingClassifier (soft voting)
- Random Forest
- Logistic Regression
- Calibrated LinearSVC
- Features: TF-IDF (text) + 24 hand-crafted metadata features
- Framework: scikit-learn
- Task: Binary classification (spam / ham)
Evaluation Results
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| VotingEnsemble | 0.974 | 0.966 | 0.980 | 0.973 |
| RandomForest | 0.978 | 0.976 | 0.977 | 0.976 |
| LogisticRegression | 0.966 | 0.955 | 0.973 | 0.964 |
| SVM | 0.969 | 0.963 | 0.972 | 0.967 |
Training Details
| Parameter | Value |
|---|---|
| Training examples | 70,000 |
| Test examples | 30,000 |
| Random state | 42 |
| Optimal threshold | 0.3714 |
| Total features | 3,024 (3,000 TF-IDF + 24 metadata) |
| Voting strategy | Soft voting |
Dataset
- VoltageVagabond/spam-email-dataset
- Sources: Kaggle 190K spam/ham + GitHub email-dataset
Files
| File | Purpose |
|---|---|
voting_model.joblib |
Trained VotingClassifier ensemble (145MB) |
tfidf_vectorizer.joblib |
Fitted TF-IDF vectorizer |
meta_scaler.joblib |
MinMaxScaler for metadata features |
feature_names.joblib |
Feature name list for explainability |
optimal_threshold.joblib |
Calibrated decision threshold |
training_sample.joblib |
Sample of training data for LIME/SHAP |
training_report.json |
Training metrics and classification report |
Usage
import joblib
from utils import preprocess_text, compute_metadata_features
model = joblib.load("voting_model.joblib")
tfidf = joblib.load("tfidf_vectorizer.joblib")
scaler = joblib.load("meta_scaler.joblib")
threshold = joblib.load("optimal_threshold.joblib")
email = "Congratulations! You've won a free iPhone!"
text_features = tfidf.transform([preprocess_text(email)])
meta_features = scaler.transform([compute_metadata_features(email)])
features = hstack([text_features, csr_matrix(meta_features)])
proba = model.predict_proba(features)[0][1]
label = "SPAM" if proba >= threshold else "HAM"
Interactive Demo
Intended Use
This model is an educational demonstration of sklearn ensemble methods with explainable AI (XAI), created as part of a university course project. It is suitable for:
- Learning how voting ensembles combine multiple classifiers
- Understanding TF-IDF text vectorization with metadata feature engineering
- Exploring LIME and SHAP explanations for model interpretability
It is not intended for production spam filtering.
Limitations
- Bag-of-words approach β TF-IDF cannot distinguish legitimate marketing from spam when vocabulary overlaps significantly
- Binary classification only (spam/ham) β no multi-class or severity ranking
- Trained on English emails only β not suitable for other languages
- Static vocabulary β cannot adapt to new spam patterns without retraining
- Threshold tuning is dataset-specific and may not generalize
Related Models
| Model | Description | Link |
|---|---|---|
| spam-classifier-mlx | Qwen 3.5 0.8B MLX LoRA fine-tune | VoltageVagabond/spam-classifier-mlx |
| spam-classifier-liquid | Liquid AI LFM2.5-1.2B LoRA fine-tune | VoltageVagabond/spam-classifier-liquid |
| spam-xai-model | Calibrated Random Forest with XAI | VoltageVagabond/spam-xai-model |
Citation
@misc{voltagevagabond2026spamgradio,
title={Spam Email Classifier β sklearn Voting Ensemble (Gradio)},
author={VoltageVagabond},
year={2026},
howpublished={\url{https://huggingface.co/VoltageVagabond/spam-classifier-gradio-model}},
note={ENGT 375 β Applied Machine Learning, Old Dominion University, Spring 2026}
}
- Downloads last month
- -