Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.17.3
title: Spam Filter App
emoji: π
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
π¨ Spam Email Classification System (ML + Gradio)
An end-to-end Spam Email Classification project built using Machine Learning, following a modular, production-ready architecture, and deployed with an interactive Gradio UI.
This system classifies emails as Spam or Not Spam using TF-IDF feature extraction and a Support Vector Machine (SVM) classifier, prioritizing high precision to reduce false positives.
π Project Overview
Spam emails often contain promotions, scams, or malicious content. Manual filtering is inefficient and error-prone.
This project automates spam detection by leveraging Natural Language Processing (NLP) and Machine Learning, providing a reliable and scalable solution.
π― Objectives
- Clean and preprocess raw email text
- Extract meaningful textual features
- Train and compare multiple ML models
- Evaluate performance using standard classification metrics
- Select the best-performing model
- Deploy the model with a user-friendly web interface
π Dataset
Source: Kaggle β Spam Email Dataset
https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset/dataColumns:
textβ Email contentspamβ Target label1= Spam0= Not Spam
The dataset contains a mix of promotional, scam, and legitimate emails.
π Project Workflow
1οΈβ£ Data Understanding
- Loaded and inspected dataset structure
- Checked shape, missing values, and duplicates
- Reviewed sample emails for context
2οΈβ£ Text Preprocessing
Applied NLP techniques to clean and normalize text:
- Lowercasing
- Removing special characters and punctuation
- Tokenization
- Stopword removal
- Lemmatization
This ensured consistent and noise-free input for modeling.
3οΈβ£ Exploratory Data Analysis (EDA)
- Analyzed class distribution (Spam vs Not Spam)
- Studied email length (words & characters)
- Identified frequent words in spam and non-spam emails
- Visualized patterns to understand data behavior
4οΈβ£ Feature Engineering
- Generated numerical features:
- Word count
- Character count
- Compared feature distributions between spam and ham emails
5οΈβ£ Model Building
Text was vectorized using:
- Bag of Words (BoW)
- TF-IDF
- TF-IDF (1β2 grams)
Models trained and evaluated:
- Naive Bayes (Multinomial, Bernoulli, Gaussian)
- Random Forest
- Extra Trees
- Linear Support Vector Machine (SVM)
Dense conversion was applied where required.
6οΈβ£ Model Evaluation
Models were evaluated using:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion Matrix
π Precision was prioritized to minimize false spam detection (false positives).
7οΈβ£ Final Model Selection
- TF-IDF + Linear SVM delivered the best balance of performance and reliability
- Final model and vectorizer were saved using
pickle
8οΈβ£ Prediction on New Emails
- New email text goes through the same preprocessing pipeline
- TF-IDF vectorization is applied
- Model predicts:
- Spam
- Not Spam
π§ Project Architecture (Modular Design)
spam-filter-app/
β
βββ app.py # Gradio application
βββ utils/
β βββ model_loader.py # Loads trained model & vectorizer
β βββ preprocessing.py # Text cleaning & NLP pipeline
β βββ predict.py # Prediction logic
β
βββ saved_models/
β βββ vectorizer_TF-IDF.pkl
β βββ SVM_TF-IDF.pkl
β
βββ notebook/
β βββ spam_classification.ipynb # Complete ML workflow
β
βββ requirements.txt
βββ README.md
β Clean separation of concerns
β Reusable utility modules
β Production-friendly structure
π₯οΈ Web Application (Gradio)
- Interactive UI for email classification
- Input full email content
- One-click prediction
- Example emails included
- Clean, minimal interface
βοΈ Technologies Used
- Python
- Scikit-learn
- NLTK
- Gradio
- Pandas & NumPy
- Pickle
- Jupyter Notebook
π Results & Conclusion
- Successfully built a robust spam classification system
- Achieved strong precision, reducing false spam flags
- Modular architecture supports easy scaling and reuse
- UI enables real-world usability and testing
This project demonstrates end-to-end ML development, from data exploration to deployment.
π Future Improvements
- Support batch email classification
- Deploy on cloud (Hugging Face / AWS / GCP)
- Add confidence scores for predictions
π€ Author
Syeda Arifa Batool
SE @ Karachi University | AI & ML Practitioner
Applying technology to create real-world value π
π Connect with Me
- LinkedIn: https://www.linkedin.com/in/arifa-batool/
- Kaggle: https://www.linkedin.com/in/arifa-batool/
- Email: thearifabatool@gmail.com
β If you find this project useful, feel free to star the repository!
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference