spam-filter-app / README.md
arifa-batool's picture
Update README.md
c59d873 verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
title: Spam Filter App
emoji: πŸ‘
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false

🚨 Spam Email Classification System (ML + Gradio)

An end-to-end Spam Email Classification project built using Machine Learning, following a modular, production-ready architecture, and deployed with an interactive Gradio UI.

This system classifies emails as Spam or Not Spam using TF-IDF feature extraction and a Support Vector Machine (SVM) classifier, prioritizing high precision to reduce false positives.


πŸ“Œ Project Overview

Spam emails often contain promotions, scams, or malicious content. Manual filtering is inefficient and error-prone.
This project automates spam detection by leveraging Natural Language Processing (NLP) and Machine Learning, providing a reliable and scalable solution.


🎯 Objectives

  • Clean and preprocess raw email text
  • Extract meaningful textual features
  • Train and compare multiple ML models
  • Evaluate performance using standard classification metrics
  • Select the best-performing model
  • Deploy the model with a user-friendly web interface

πŸ“‚ Dataset

The dataset contains a mix of promotional, scam, and legitimate emails.


πŸ”„ Project Workflow

1️⃣ Data Understanding

  • Loaded and inspected dataset structure
  • Checked shape, missing values, and duplicates
  • Reviewed sample emails for context

2️⃣ Text Preprocessing

Applied NLP techniques to clean and normalize text:

  • Lowercasing
  • Removing special characters and punctuation
  • Tokenization
  • Stopword removal
  • Lemmatization

This ensured consistent and noise-free input for modeling.


3️⃣ Exploratory Data Analysis (EDA)

  • Analyzed class distribution (Spam vs Not Spam)
  • Studied email length (words & characters)
  • Identified frequent words in spam and non-spam emails
  • Visualized patterns to understand data behavior

4️⃣ Feature Engineering

  • Generated numerical features:
    • Word count
    • Character count
  • Compared feature distributions between spam and ham emails

5️⃣ Model Building

Text was vectorized using:

  • Bag of Words (BoW)
  • TF-IDF
  • TF-IDF (1–2 grams)

Models trained and evaluated:

  • Naive Bayes (Multinomial, Bernoulli, Gaussian)
  • Random Forest
  • Extra Trees
  • Linear Support Vector Machine (SVM)

Dense conversion was applied where required.


6️⃣ Model Evaluation

Models were evaluated using:

  • Accuracy
  • Precision
  • Recall
  • F1-score
  • Confusion Matrix

πŸ“Œ Precision was prioritized to minimize false spam detection (false positives).


7️⃣ Final Model Selection

  • TF-IDF + Linear SVM delivered the best balance of performance and reliability
  • Final model and vectorizer were saved using pickle

8️⃣ Prediction on New Emails

  • New email text goes through the same preprocessing pipeline
  • TF-IDF vectorization is applied
  • Model predicts:
    • Spam
    • Not Spam

🧠 Project Architecture (Modular Design)


spam-filter-app/
β”‚
β”œβ”€β”€ app.py                  # Gradio application
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ model_loader.py     # Loads trained model & vectorizer
β”‚   β”œβ”€β”€ preprocessing.py   # Text cleaning & NLP pipeline
β”‚   └── predict.py          # Prediction logic
β”‚
β”œβ”€β”€ saved_models/
β”‚   β”œβ”€β”€ vectorizer_TF-IDF.pkl
β”‚   └── SVM_TF-IDF.pkl
β”‚
β”œβ”€β”€ notebook/
β”‚   └── spam_classification.ipynb  # Complete ML workflow
β”‚
β”œβ”€β”€ requirements.txt
└── README.md

βœ” Clean separation of concerns
βœ” Reusable utility modules
βœ” Production-friendly structure


πŸ–₯️ Web Application (Gradio)

  • Interactive UI for email classification
  • Input full email content
  • One-click prediction
  • Example emails included
  • Clean, minimal interface

βš™οΈ Technologies Used

  • Python
  • Scikit-learn
  • NLTK
  • Gradio
  • Pandas & NumPy
  • Pickle
  • Jupyter Notebook

πŸ“ˆ Results & Conclusion

  • Successfully built a robust spam classification system
  • Achieved strong precision, reducing false spam flags
  • Modular architecture supports easy scaling and reuse
  • UI enables real-world usability and testing

This project demonstrates end-to-end ML development, from data exploration to deployment.


πŸš€ Future Improvements

  • Support batch email classification
  • Deploy on cloud (Hugging Face / AWS / GCP)
  • Add confidence scores for predictions

πŸ‘€ Author

Syeda Arifa Batool
SE @ Karachi University | AI & ML Practitioner
Applying technology to create real-world value πŸ“ˆ


πŸ”— Connect with Me

⭐ If you find this project useful, feel free to star the repository!

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference