File size: 1,508 Bytes
6db632d
 
 
 
 
 
 
 
 
 
43176db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd45290
43176db
 
 
 
 
 
 
 
 
c23f783
43176db
 
 
 
 
6db632d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
datasets:
- AbdulHadi806/mail_spam_ham_dataset
language:
- en
metrics:
- f1
base_model:
- distilbert/distilbert-base-uncased
---
# Deep Learning Project: Spam Detection with DistilBERT

This repository contains the code and resources for the Deep Learning project on Spam Detection.

## Project Structure
- `mail_data.csv`: The dataset used for training and evaluation.
- `eda_script.py`: Script for Exploratory Data Analysis and visualization.
- `train_model_hf.py`: Main training script using Hugging Face Trainer and DistilBERT.
- `evaluate_final.py`: Script for final evaluation from the best model checkpoint.
- `eda_plots.png`: Visualizations generated during EDA.
- `results.txt`: Detailed evaluation metrics and confusion matrix.
- `Deep_Learning_Project_Report.pdf`: The final project report (15-17 pages equivalent).

## Requirements
- Python 3.11+
- PyTorch 
- Transformers
- Datasets
- Scikit-learn
- Pandas
- Matplotlib
- Seaborn
- Accelerate

## How to Run
0.  Make sure you have all requirements downloaded. In case of errors while running the code, try installing the dependencies in requirements.txt in a fresh environment.
1.  **EDA**: Run `python3 eda_script.py` to see the data distribution.
2.  **Training**: Run `python3 train_model_hf.py` to fine-tune the DistilBERT model.
3.  **Evaluation**: Run `python3 evaluate_final.py` to get the final performance metrics.

## Results
The model achieves **99.10% accuracy** on the test set with an **F1-score of 96.58%** for the spam class.