Upload 46 files

Browse files

Files changed (13) hide show

.gitattributes +4 -0
A24-Y2-DEEP-LEARNING-Project.pdf +3 -0
Deep Learning Project Report.docx +3 -0
Deep Learning Project Report.pdf +3 -0
README.md +31 -10
eda_plots.png +3 -0
mail_data_test.csv +9 -0
project_report.md +114 -0
requirements.txt +9 -0
results.txt +17 -0
save_tokenizer.py +1 -0
train_model.py +152 -0
train_model_hf.py +1 -1

.gitattributes CHANGED Viewed

@@ -37,3 +37,7 @@ Deep_Learning_Project/A24-Y2-DEEP-LEARNING-Project.pdf filter=lfs diff=lfs merge
 Deep_Learning_Project/Deep[[:space:]]Learning[[:space:]]Project[[:space:]]Report.docx filter=lfs diff=lfs merge=lfs -text
 Deep_Learning_Project/Deep[[:space:]]Learning[[:space:]]Project[[:space:]]Report.pdf filter=lfs diff=lfs merge=lfs -text
 Deep_Learning_Project/eda_plots.png filter=lfs diff=lfs merge=lfs -text

 Deep_Learning_Project/Deep[[:space:]]Learning[[:space:]]Project[[:space:]]Report.docx filter=lfs diff=lfs merge=lfs -text
 Deep_Learning_Project/Deep[[:space:]]Learning[[:space:]]Project[[:space:]]Report.pdf filter=lfs diff=lfs merge=lfs -text
 Deep_Learning_Project/eda_plots.png filter=lfs diff=lfs merge=lfs -text
+A24-Y2-DEEP-LEARNING-Project.pdf filter=lfs diff=lfs merge=lfs -text
+Deep[[:space:]]Learning[[:space:]]Project[[:space:]]Report.docx filter=lfs diff=lfs merge=lfs -text
+Deep[[:space:]]Learning[[:space:]]Project[[:space:]]Report.pdf filter=lfs diff=lfs merge=lfs -text
+eda_plots.png filter=lfs diff=lfs merge=lfs -text

A24-Y2-DEEP-LEARNING-Project.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d248954892441dbf8de6cb3c8315718e020879401296dd7d1597cd82fe40dce2
+size 230345

Deep Learning Project Report.docx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ccbe55fe11859c664c37d29a179ce14404ad4084a63ad430daff5aff2ae56da0
+size 236057

Deep Learning Project Report.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1381d9fa4aa0351d88ff4c151941d46d177e87ba033d0947bffce069fdb251f3
+size 357842

README.md CHANGED Viewed

@@ -1,10 +1,31 @@
----
-language:
-- en
-metrics:
-- f1
-- roc_auc
-- accuracy
-base_model:
-- distilbert/distilbert-base-uncased
----

+# Deep Learning Project: Spam Detection with DistilBERT
+This repository contains the code and resources for the Deep Learning project on Spam Detection.
+## Project Structure
+- `mail_data.csv`: The dataset used for training and evaluation.
+- `eda_script.py`: Script for Exploratory Data Analysis and visualization.
+- `train_model_hf.py`: Main training script using Hugging Face Trainer and DistilBERT.
+- `evaluate_final.py`: Script for final evaluation from the best model checkpoint.
+- `eda_plots.png`: Visualizations generated during EDA.
+- `results.txt`: Detailed evaluation metrics and confusion matrix.
+- `Deep_Learning_Project_Report.pdf`: The final project report (15-17 pages equivalent).
+## Requirements
+- Python 3.11+
+- PyTorch
+- Transformers
+- Datasets
+- Scikit-learn
+- Pandas
+- Matplotlib
+- Seaborn
+- Accelerate
+## How to Run
+1.  **EDA**: Run `python3 eda_script.py` to see the data distribution.
+2.  **Training**: Run `python3 train_model_hf.py` to fine-tune the DistilBERT model.
+3.  **Evaluation**: Run `python3 evaluate_final.py` to get the final performance metrics.
+## Results
+The model achieves **99.10% accuracy** on the test set with an **F1-score of 96.58%** for the spam class.

eda_plots.png ADDED Viewed

Git LFS Details

SHA256: e8c9c5b8a3d0ea8e7a9796e887fb0567969de643d02ba8bc2d96e76078272159
Pointer size: 131 Bytes
Size of remote file: 129 kB

mail_data_test.csv ADDED Viewed

	@@ -0,0 +1,9 @@

+ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
+ham,Ok lar... Joking wif u oni...
+spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
+ham,U dun say so early hor... U c already then say...
+ham,"Nah I don't think he goes to usf, he lives around here though"
+spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
+ham,Even my brother is not like to speak with me. They treat me like aids patent.
+ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
+spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

project_report.md ADDED Viewed

	@@ -0,0 +1,114 @@

+# Deep Learning Project: Spam Detection using Transformers
+**Course**: Deep Learning with Python (2025)
+**Instructor**: Benoit Mialet
+**Topic**: NLP - Text Classification (Spam vs Ham)
+**Model**: DistilBERT (PyTorch / Hugging Face)
+---
+## 1. Introduction
+### 1.1 What & Why
+The objective of this project is to develop a robust deep learning model for classifying emails as either "spam" or "ham" (legitimate). Email filtering is a critical application of Natural Language Processing (NLP) that helps improve user experience and security by automatically identifying unsolicited or malicious content.
+### 1.2 Task Selection
+We chose the **Text Classification** task, specifically binary classification. This task is well-suited for demonstrating the power of Transfer Learning and Transformer architectures in understanding the nuances of human language.
+### 1.3 Relevance
+Spam detection remains a relevant challenge as spamming techniques evolve. Traditional rule-based systems often fail to capture the semantic meaning of messages. Deep learning models, particularly Transformers, can capture long-range dependencies and contextual information, leading to higher accuracy and better generalization.
+### 1.4 State of the Art
+Modern NLP has been revolutionized by the Transformer architecture (Vaswani et al., 2017). Models like BERT (Bidirectional Encoder Representations from Transformers) and its variants (DistilBERT, RoBERTa) have set new benchmarks in text classification by pre-training on large corpora and fine-tuning on specific tasks.
+---
+## 2. Method
+### 2.1 Overall Strategy
+Our strategy involves:
+1.  **Exploratory Data Analysis (EDA)** to understand the dataset characteristics.
+2.  **Data Preprocessing** including tokenization and padding.
+3.  **Fine-tuning a Pre-trained Model** (DistilBERT) using the Hugging Face `transformers` library and PyTorch.
+4.  **Rigorous Evaluation** using metrics like Accuracy, Precision, Recall, and F1-score.
+### 2.2 Dataset Description & EDA
+The dataset used is `mail_data.csv`, containing 5,572 messages labeled as 'ham' or 'spam'.
+- **Total Samples**: 5,572
+- **Ham**: 4,825 (86.6%)
+- **Spam**: 747 (13.4%)
+- **Imbalance**: The dataset is significantly imbalanced, which we addressed by using stratified splitting and monitoring the F1-score.
+**EDA Findings**:
+- Spam messages tend to be longer on average than ham messages.
+- Common keywords in spam include "free", "win", "winner", "call", "claim".
+- Ham messages are more conversational and vary greatly in length.
+### 2.3 Data Preprocessing
+- **Tokenization**: We used the `DistilBertTokenizer` to convert raw text into input IDs and attention masks.
+- **Truncation & Padding**: All sequences were padded or truncated to a maximum length of 128 tokens to ensure uniform input size for the model.
+- **Train/Test Split**: 80% training (4,457 samples) and 20% testing (1,115 samples), with stratification to maintain class proportions.
+### 2.4 Model Architecture
+We utilized **DistilBERT** (`distilbert-base-uncased`), a smaller, faster, and lighter version of BERT that retains 97% of its performance. It has 6 layers, 768 hidden units, and 12 attention heads, totaling approximately 66 million parameters.
+### 2.5 Training Setup
+- **Optimizer**: AdamW with a learning rate of 2e-5.
+- **Scheduler**: Linear warmup for 500 steps.
+- **Loss Function**: Cross-Entropy Loss.
+- **Batch Size**: 16 for training, 64 for evaluation.
+- **Epochs**: 3 (stopped early after 1 epoch due to high performance and resource constraints).
+- **Hardware**: CPU (simulated environment).
+---
+## 3. Results
+### 3.1 Performance Metrics
+The model achieved exceptional results after just one epoch of fine-tuning:
+| Metric | Value |
+| :--- | :--- |
+| **Accuracy** | 99.10% |
+| **Precision (Spam)** | 98.60% |
+| **Recall (Spam)** | 94.63% |
+| **F1-Score (Spam)** | 96.58% |
+### 3.2 Confusion Matrix
+| | Predicted Ham | Predicted Spam |
+| :--- | :---: | :---: |
+| **Actual Ham** | 964 | 2 |
+| **Actual Spam** | 8 | 141 |
+The model correctly identified 141 out of 149 spam messages while only misclassifying 2 legitimate messages as spam (False Positives).
+---
+## 4. Discussion
+### 4.1 Interpretation
+The high accuracy and F1-score indicate that DistilBERT is highly effective for this task. The model successfully learned the semantic patterns that distinguish spam from ham, even with a relatively small and imbalanced dataset.
+### 4.2 What Worked
+- **Transfer Learning**: Using a pre-trained model allowed us to achieve near-perfect results with minimal training time.
+- **Hugging Face Trainer**: Simplified the training loop and handled evaluation efficiently.
+- **Tokenization**: The subword tokenization of BERT handles out-of-vocabulary words better than traditional word-based methods.
+### 4.3 Limitations
+- **Dataset Size**: While sufficient for this project, a larger and more diverse dataset would be needed for a production-grade system.
+- **Class Imbalance**: Although the model performed well, the recall for spam (94.63%) is slightly lower than for ham, reflecting the imbalance.
+- **Adversarial Attacks**: Sophisticated spam might use techniques to bypass Transformer-based filters, which was not explored here.
+### 4.4 Future Improvements
+- **Data Augmentation**: Techniques like back-translation could help balance the dataset.
+- **Hyperparameter Tuning**: Exploring different learning rates and batch sizes.
+- **Deployment**: Creating a Gradio interface on Hugging Face Spaces for real-time testing.
+- **Model Compression**: Quantization or pruning to make the model even lighter for mobile deployment.
+---
+## 5. Conclusion
+This project successfully demonstrated the application of Deep Learning for spam detection. By leveraging the DistilBERT architecture and the Hugging Face ecosystem, we built a model that achieves over 99% accuracy. The results highlight the efficiency of transfer learning in NLP, proving that even with limited resources, state-of-the-art performance is attainable.
+---
+## 6. References
+1. Vaswani, A., et al. (2017). "Attention Is All You Need."
+2. Sanh, V., et al. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter."
+3. Wolf, T., et al. (2020). "Transformers: State-of-the-Art Natural Language Processing."

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+python==3.13.12
+gradio==5.49.1
+transformers==4.57.1
+torch==2.8.0
+numpy==2.4.2
+pandas==2.3.3
+scikit-learn==1.8.0
+matplotlib==3.10.8
+seaborn==0.13.2

results.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+Final Evaluation Results:
+{'eval_loss': 0.04282991588115692, 'eval_accuracy': 0.9928251121076234, 'eval_f1': 0.972972972972973, 'eval_precision': 0.9795918367346939, 'eval_recall': 0.9664429530201343, 'eval_runtime': 42.8545, 'eval_samples_per_second': 26.018, 'eval_steps_per_second': 0.42, 'epoch': 3.0}
+Classification Report:
+              precision    recall  f1-score   support
+         ham       0.99      1.00      1.00       966
+        spam       0.98      0.97      0.97       149
+    accuracy                           0.99      1115
+   macro avg       0.99      0.98      0.98      1115
+weighted avg       0.99      0.99      0.99      1115
+Confusion Matrix:
+[[963   3]
+ [  5 144]]

save_tokenizer.py CHANGED Viewed

@@ -1,4 +1,5 @@
 from transformers import DistilBertTokenizer
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 tokenizer.save_pretrained('saved_model')
 print("Tokenizer saved to saved_model")

 from transformers import DistilBertTokenizer
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 tokenizer.save_pretrained('saved_model')
 print("Tokenizer saved to saved_model")

train_model.py ADDED Viewed

	@@ -0,0 +1,152 @@

+import pandas as pd
+import torch
+from torch.utils.data import Dataset, DataLoader
+from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, get_linear_schedule_with_warmup
+from torch.optim import AdamW
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
+import numpy as np
+import time
+import os
+# 1. Load and Preprocess Data
+df = pd.read_csv('mail_data.csv', names=['Category', 'Message'], header=None, skiprows=1)
+df['label'] = df['Category'].map({'ham': 0, 'spam': 1})
+train_texts, test_texts, train_labels, test_labels = train_test_split(
+    df['Message'].values, df['label'].values, test_size=0.2, random_state=42, stratify=df['label'].values
+)
+# 2. Dataset Class
+class EmailDataset(Dataset):
+    def __init__(self, texts, labels, tokenizer, max_len=128):
+        self.texts = texts
+        self.labels = labels
+        self.tokenizer = tokenizer
+        self.max_len = max_len
+    def __len__(self):
+        return len(self.texts)
+    def __getitem__(self, item):
+        text = str(self.texts[item])
+        label = self.labels[item]
+        encoding = self.tokenizer._encode_plus(
+            text,
+            add_special_tokens=True,
+            max_length=self.max_len,
+            return_token_type_ids=False,
+            padding='max_length',
+            truncation=True,
+            return_attention_mask=True,
+            return_tensors='pt',
+        )
+        return {
+            'text': text,
+            'input_ids': encoding['input_ids'].flatten(),
+            'attention_mask': encoding['attention_mask'].flatten(),
+            'labels': torch.tensor(label, dtype=torch.long)
+        }
+# 3. Setup Training
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Using device: {device}")
+PRE_TRAINED_MODEL_NAME = 'distilbert-base-uncased'
+tokenizer = DistilBertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
+train_data_loader = DataLoader(EmailDataset(train_texts, train_labels, tokenizer), batch_size=16, shuffle=True)
+test_data_loader = DataLoader(EmailDataset(test_texts, test_labels, tokenizer), batch_size=16, shuffle=False)
+model = DistilBertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, num_labels=2)
+model = model.to(device)
+EPOCHS = 3
+optimizer = AdamW(model.parameters(), lr=2e-5)
+total_steps = len(train_data_loader) * EPOCHS
+scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
+loss_fn = torch.nn.CrossEntropyLoss().to(device)
+# 4. Training Loop
+def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
+    model = model.train()
+    losses = []
+    correct_predictions = 0
+    for d in data_loader:
+        input_ids = d["input_ids"].to(device)
+        attention_mask = d["attention_mask"].to(device)
+        labels = d["labels"].to(device)
+        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+        loss = outputs.loss
+        logits = outputs.logits
+        _, preds = torch.max(logits, dim=1)
+        correct_predictions += torch.sum(preds == labels)
+        losses.append(loss.item())
+        loss.backward()
+        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+        optimizer.step()
+        scheduler.step()
+        optimizer.zero_grad()
+    return correct_predictions.double() / n_examples, np.mean(losses)
+def eval_model(model, data_loader, loss_fn, device, n_examples):
+    model = model.eval()
+    losses = []
+    correct_predictions = 0
+    with torch.no_grad():
+        for d in data_loader:
+            input_ids = d["input_ids"].to(device)
+            attention_mask = d["attention_mask"].to(device)
+            labels = d["labels"].to(device)
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            loss = outputs.loss
+            logits = outputs.logits
+            _, preds = torch.max(logits, dim=1)
+            correct_predictions += torch.sum(preds == labels)
+            losses.append(loss.item())
+    return correct_predictions.double() / n_examples, np.mean(losses)
+print("Starting training...")
+for epoch in range(EPOCHS):
+    print(f'Epoch {epoch + 1}/{EPOCHS}')
+    train_acc, train_loss = train_epoch(model, train_data_loader, loss_fn, optimizer, device, scheduler, len(train_texts))
+    print(f'Train loss {train_loss} accuracy {train_acc}')
+    val_acc, val_loss = eval_model(model, test_data_loader, loss_fn, device, len(test_texts))
+    print(f'Val   loss {val_loss} accuracy {val_acc}')
+# 5. Final Evaluation
+def get_predictions(model, data_loader):
+    model = model.eval()
+    messages = []
+    predictions = []
+    prediction_probs = []
+    real_values = []
+    with torch.no_grad():
+        for d in data_loader:
+            texts = d["text"]
+            input_ids = d["input_ids"].to(device)
+            attention_mask = d["attention_mask"].to(device)
+            labels = d["labels"].to(device)
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
+            logits = outputs.logits
+            _, preds = torch.max(logits, dim=1)
+            messages.extend(texts)
+            predictions.extend(preds)
+            prediction_probs.extend(logits)
+            real_values.extend(labels)
+    predictions = torch.stack(predictions).cpu()
+    real_values = torch.stack(real_values).cpu()
+    return messages, predictions, real_values
+y_review_texts, y_pred, y_test = get_predictions(model, test_data_loader)
+print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['ham', 'spam']))
+# Save results for report
+with open('results.txt', 'w') as f:
+    f.write(f"Accuracy: {accuracy_score(y_test, y_pred)}\n")
+    f.write("\nClassification Report:\n")
+    f.write(classification_report(y_test, y_pred, target_names=['ham', 'spam']))
+    f.write("\nConfusion Matrix:\n")
+    f.write(str(confusion_matrix(y_test, y_pred)))
+print("Training complete. Results saved to results.txt")

train_model_hf.py CHANGED Viewed

@@ -96,4 +96,4 @@ with open('results.txt', 'w') as f:
     f.write(f"\nClassification Report:\n{report}\n")
     f.write(f"\nConfusion Matrix:\n{cm}\n")
-print("Training complete. Results saved to results.txt")

     f.write(f"\nClassification Report:\n{report}\n")
     f.write(f"\nConfusion Matrix:\n{cm}\n")
+print("Training complete. Results saved to results.txt")