deepl-project-model / README.md
picket-cliff's picture
Update README.md
b827475 verified
---
library_name: transformers
language:
- en
metrics:
- f1
base_model:
- distilbert/distilbert-base-uncased
datasets:
- AbdulHadi806/mail_spam_ham_dataset
pipeline_tag: text-classification
---
# Model Card for Model ID
Text classification Model for Spam Detection (Deep Learning Project).
## Model Details
### Model Description
Model developped for the "Deep Learning with Python" course Project
- **Developed by:** Diavila Rostaing Engandzi
- **Model type:** Binary Text Classification
- **Language(s) (NLP):** English
- **Finetuned from model:** DistilBERT
### Model Sources
- **Demo [optional]:** https://huggingface.co/picket-cliff/deepl-project
## Uses
The model is intended to be used to sort spam in emails. Clone and Run the app.py file in the Demo to see it in action.
## Training Details
### Training Data
Subset from the email_data.csv dataset [card].
A benchmark dataset for email classification with around 5000 emailed classified between "ham" and "spam".
To evaluate the model, data was separated between training and test datasets (80-20 split).
#### Preprocessing
Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer.
1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens.
2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision.
3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128.
o Sentences longer than 128 tokens were truncated.
o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0).
4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding).
## Evaluation
Results obtained directly from training on the training dataset then evaluating the model on the testing data.
Result are compared to a baseline (dummy classifier) for reference.
### Testing Data, Factors & Metrics
#### Testing Data
#### Metrics
Accuracy, f1 score (macro and weighted)
### Results
When evaluated on a 80-20 split we obtained:
• Accuracy: 99.10%
• Macro Average F1-Score: 0.98
• Weighted Average F1-Score: 0.99
Meanwhile the dummy achieved 86.6% accuracy.
#### Summary
The model performance is