File size: 2,834 Bytes

6044025
 
5a78d35
 
 
 
 
 
d7c3e45
 
 
6044025
 
 
 
5a78d35
6044025
 
 
 
 
 
d7c3e45
6044025
d7c3e45
5a78d35
d7c3e45
5a78d35
6044025
d7c3e45
6044025
d7c3e45
6044025
 
 
d7c3e45
6044025
 
 
 
 
d7c3e45
6044025
d7c3e45
 
6044025
d7c3e45
6044025
d7c3e45
58604c5
d7c3e45
58604c5
d7c3e45
58604c5
d7c3e45
febc162
d7c3e45
febc162
d7c3e45
58604c5
d7c3e45
6044025
 
 
 
d7c3e45
 
6044025
 
 
 
 
 
 
 
 
d7c3e45
6044025
 
 
d7c3e45
febc162
d7c3e45
febc162
d7c3e45
febc162
d7c3e45
febc162
d7c3e45
6044025
 
 
d7c3e45

---
library_name: transformers
language:
- en
metrics:
- f1
base_model:
- distilbert/distilbert-base-uncased
datasets:
- AbdulHadi806/mail_spam_ham_dataset
pipeline_tag: text-classification
---

# Model Card for Model ID

Text classification Model for Spam Detection  (Deep Learning Project).


## Model Details

### Model Description

Model developped for the "Deep Learning with Python" course Project

- **Developed by:** Diavila Rostaing Engandzi
- **Model type:** Binary Text Classification
- **Language(s) (NLP):** English
- **Finetuned from model:** DistilBERT 

### Model Sources

- **Demo [optional]:** https://huggingface.co/picket-cliff/deepl-project

## Uses

The model is intended to be used to sort spam in emails. Clone and Run the app.py file in the Demo to see it in action.

## Training Details

### Training Data

Subset from the email_data.csv dataset [card].

A benchmark dataset for email classification with around 5000 emailed classified between "ham" and "spam".
To evaluate the model, data was separated between training and test datasets (80-20 split).

#### Preprocessing

Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer.

1.	Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens.

2.	Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision.

3.	Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128.

  o	Sentences longer than 128 tokens were truncated.
  
  o	Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0).

4.	Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding).


## Evaluation

Results obtained directly from training on the training dataset then evaluating the model on the testing data.
Result are compared to a baseline (dummy classifier) for reference.

### Testing Data, Factors & Metrics

#### Testing Data



#### Metrics

Accuracy, f1 score (macro and weighted)

### Results

When evaluated on a 80-20 split we obtained:

•	Accuracy: 99.10%

•	Macro Average F1-Score: 0.98

•	Weighted Average F1-Score: 0.99

Meanwhile the dummy achieved 86.6% accuracy.

#### Summary

The model performance is