| --- |
| datasets: |
| - AbdulHadi806/mail_spam_ham_dataset |
| language: |
| - en |
| metrics: |
| - f1 |
| base_model: |
| - distilbert/distilbert-base-uncased |
| --- |
| # Deep Learning Project: Spam Detection with DistilBERT |
|
|
| This repository contains the code and resources for the Deep Learning project on Spam Detection. |
|
|
| ## Project Structure |
| - `mail_data.csv`: The dataset used for training and evaluation. |
| - `eda_script.py`: Script for Exploratory Data Analysis and visualization. |
| - `train_model_hf.py`: Main training script using Hugging Face Trainer and DistilBERT. |
| - `evaluate_final.py`: Script for final evaluation from the best model checkpoint. |
| - `eda_plots.png`: Visualizations generated during EDA. |
| - `results.txt`: Detailed evaluation metrics and confusion matrix. |
| - `Deep_Learning_Project_Report.pdf`: The final project report (15-17 pages equivalent). |
|
|
| ## Requirements |
| - Python 3.11+ |
| - PyTorch |
| - Transformers |
| - Datasets |
| - Scikit-learn |
| - Pandas |
| - Matplotlib |
| - Seaborn |
| - Accelerate |
|
|
| ## How to Run |
| 0. Make sure you have all requirements downloaded. In case of errors while running the code, try installing the dependencies in requirements.txt in a fresh environment. |
| 1. **EDA**: Run `python3 eda_script.py` to see the data distribution. |
| 2. **Training**: Run `python3 train_model_hf.py` to fine-tune the DistilBERT model. |
| 3. **Evaluation**: Run `python3 evaluate_final.py` to get the final performance metrics. |
|
|
| ## Results |
| The model achieves **99.10% accuracy** on the test set with an **F1-score of 96.58%** for the spam class. |