arifa-batool commited on
Commit
c59d873
Β·
verified Β·
1 Parent(s): ec159e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +211 -0
README.md CHANGED
@@ -9,4 +9,215 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
9
  pinned: false
10
  ---
11
 
12
+
13
+ # 🚨 Spam Email Classification System (ML + Gradio)
14
+
15
+ An end-to-end **Spam Email Classification** project built using **Machine Learning**, following a **modular, production-ready architecture**, and deployed with an interactive **Gradio UI**.
16
+
17
+ This system classifies emails as **Spam** or **Not Spam** using **TF-IDF feature extraction** and a **Support Vector Machine (SVM)** classifier, prioritizing **high precision** to reduce false positives.
18
+
19
+ ---
20
+
21
+ ## πŸ“Œ Project Overview
22
+
23
+ Spam emails often contain promotions, scams, or malicious content. Manual filtering is inefficient and error-prone.
24
+ This project automates spam detection by leveraging **Natural Language Processing (NLP)** and **Machine Learning**, providing a reliable and scalable solution.
25
+
26
+ ---
27
+
28
+ ## 🎯 Objectives
29
+
30
+ - Clean and preprocess raw email text
31
+ - Extract meaningful textual features
32
+ - Train and compare multiple ML models
33
+ - Evaluate performance using standard classification metrics
34
+ - Select the best-performing model
35
+ - Deploy the model with a user-friendly web interface
36
+
37
+ ---
38
+
39
+ ## πŸ“‚ Dataset
40
+
41
+ - **Source:** Kaggle – Spam Email Dataset
42
+ https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset/data
43
+
44
+ - **Columns:**
45
+ - `text` β†’ Email content
46
+ - `spam` β†’ Target label
47
+ - `1` = Spam
48
+ - `0` = Not Spam
49
+
50
+ The dataset contains a mix of promotional, scam, and legitimate emails.
51
+
52
+ ---
53
+
54
+ ## πŸ”„ Project Workflow
55
+
56
+ ### 1️⃣ Data Understanding
57
+ - Loaded and inspected dataset structure
58
+ - Checked shape, missing values, and duplicates
59
+ - Reviewed sample emails for context
60
+
61
+ ---
62
+
63
+ ### 2️⃣ Text Preprocessing
64
+ Applied NLP techniques to clean and normalize text:
65
+ - Lowercasing
66
+ - Removing special characters and punctuation
67
+ - Tokenization
68
+ - Stopword removal
69
+ - Lemmatization
70
+
71
+ This ensured consistent and noise-free input for modeling.
72
+
73
+ ---
74
+
75
+ ### 3️⃣ Exploratory Data Analysis (EDA)
76
+ - Analyzed class distribution (Spam vs Not Spam)
77
+ - Studied email length (words & characters)
78
+ - Identified frequent words in spam and non-spam emails
79
+ - Visualized patterns to understand data behavior
80
+
81
+ ---
82
+
83
+ ### 4️⃣ Feature Engineering
84
+ - Generated numerical features:
85
+ - Word count
86
+ - Character count
87
+ - Compared feature distributions between spam and ham emails
88
+
89
+ ---
90
+
91
+ ### 5️⃣ Model Building
92
+ Text was vectorized using:
93
+ - **Bag of Words (BoW)**
94
+ - **TF-IDF**
95
+ - **TF-IDF (1–2 grams)**
96
+
97
+ Models trained and evaluated:
98
+ - Naive Bayes (Multinomial, Bernoulli, Gaussian)
99
+ - Random Forest
100
+ - Extra Trees
101
+ - **Linear Support Vector Machine (SVM)**
102
+
103
+ Dense conversion was applied where required.
104
+
105
+ ---
106
+
107
+ ### 6️⃣ Model Evaluation
108
+ Models were evaluated using:
109
+ - Accuracy
110
+ - Precision
111
+ - Recall
112
+ - F1-score
113
+ - Confusion Matrix
114
+
115
+ πŸ“Œ **Precision was prioritized** to minimize false spam detection (false positives).
116
+
117
+ ---
118
+
119
+ ### 7️⃣ Final Model Selection
120
+ - **TF-IDF + Linear SVM** delivered the best balance of performance and reliability
121
+ - Final model and vectorizer were saved using `pickle`
122
+
123
+ ---
124
+
125
+ ### 8️⃣ Prediction on New Emails
126
+ - New email text goes through the same preprocessing pipeline
127
+ - TF-IDF vectorization is applied
128
+ - Model predicts:
129
+ - **Spam**
130
+ - **Not Spam**
131
+
132
+ ---
133
+
134
+ ## 🧠 Project Architecture (Modular Design)
135
+
136
+ ```
137
+
138
+ spam-filter-app/
139
+ β”‚
140
+ β”œβ”€β”€ app.py # Gradio application
141
+ β”œβ”€β”€ utils/
142
+ β”‚ β”œβ”€β”€ model_loader.py # Loads trained model & vectorizer
143
+ β”‚ β”œβ”€β”€ preprocessing.py # Text cleaning & NLP pipeline
144
+ β”‚ └── predict.py # Prediction logic
145
+ β”‚
146
+ β”œβ”€β”€ saved_models/
147
+ β”‚ β”œβ”€β”€ vectorizer_TF-IDF.pkl
148
+ β”‚ └── SVM_TF-IDF.pkl
149
+ β”‚
150
+ β”œβ”€β”€ notebook/
151
+ β”‚ └── spam_classification.ipynb # Complete ML workflow
152
+ β”‚
153
+ β”œβ”€β”€ requirements.txt
154
+ └── README.md
155
+
156
+ ```
157
+
158
+ βœ” Clean separation of concerns
159
+ βœ” Reusable utility modules
160
+ βœ” Production-friendly structure
161
+
162
+ ---
163
+
164
+ ## πŸ–₯️ Web Application (Gradio)
165
+
166
+ - Interactive UI for email classification
167
+ - Input full email content
168
+ - One-click prediction
169
+ - Example emails included
170
+ - Clean, minimal interface
171
+
172
+ ---
173
+
174
+ ## βš™οΈ Technologies Used
175
+
176
+ - **Python**
177
+ - **Scikit-learn**
178
+ - **NLTK**
179
+ - **Gradio**
180
+ - **Pandas & NumPy**
181
+ - **Pickle**
182
+ - **Jupyter Notebook**
183
+
184
+ ---
185
+
186
+ ## πŸ“ˆ Results & Conclusion
187
+
188
+ - Successfully built a robust spam classification system
189
+ - Achieved strong precision, reducing false spam flags
190
+ - Modular architecture supports easy scaling and reuse
191
+ - UI enables real-world usability and testing
192
+
193
+ This project demonstrates **end-to-end ML development**, from data exploration to deployment.
194
+
195
+ ---
196
+
197
+ ## πŸš€ Future Improvements
198
+
199
+ - Support batch email classification
200
+ - Deploy on cloud (Hugging Face / AWS / GCP)
201
+ - Add confidence scores for predictions
202
+
203
+ ---
204
+
205
+ ## πŸ‘€ Author
206
+
207
+ **Syeda Arifa Batool**
208
+ SE @ Karachi University | AI & ML Practitioner
209
+ Applying technology to create real-world value πŸ“ˆ
210
+
211
+ ---
212
+
213
+ ## πŸ”— Connect with Me
214
+
215
+ - **LinkedIn:** https://www.linkedin.com/in/arifa-batool/
216
+ - **Kaggle:** https://www.linkedin.com/in/arifa-batool/
217
+ - **Email:** thearifabatool@gmail.com
218
+
219
+
220
+ ⭐ If you find this project useful, feel free to star the repository!
221
+
222
+
223
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference