picket-cliff
/

deepl-project-model

Text Classification

Model card Files Files and versions

deepl-project-model / README.md

picket-cliff's picture

Update README.md

b827475 verified 3 months ago

|

history blame contribute delete

2.83 kB

	---
	library_name: transformers
	language:
	- en
	metrics:
	- f1
	base_model:
	- distilbert/distilbert-base-uncased
	datasets:
	- AbdulHadi806/mail_spam_ham_dataset
	pipeline_tag: text-classification
	---

	# Model Card for Model ID

	Text classification Model for Spam Detection (Deep Learning Project).


	## Model Details

	### Model Description

	Model developped for the "Deep Learning with Python" course Project

	- Developed by: Diavila Rostaing Engandzi
	- Model type: Binary Text Classification
	- Language(s) (NLP): English
	- Finetuned from model: DistilBERT

	### Model Sources

	- Demo [optional]: https://huggingface.co/picket-cliff/deepl-project

	## Uses

	The model is intended to be used to sort spam in emails. Clone and Run the app.py file in the Demo to see it in action.

	## Training Details

	### Training Data

	Subset from the email_data.csv dataset [card].

	A benchmark dataset for email classification with around 5000 emailed classified between "ham" and "spam".
	To evaluate the model, data was separated between training and test datasets (80-20 split).

	#### Preprocessing

	Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer.

	1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens.

	2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision.

	3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128.

	o Sentences longer than 128 tokens were truncated.

	o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0).

	4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding).


	## Evaluation

	Results obtained directly from training on the training dataset then evaluating the model on the testing data.
	Result are compared to a baseline (dummy classifier) for reference.

	### Testing Data, Factors & Metrics

	#### Testing Data



	#### Metrics

	Accuracy, f1 score (macro and weighted)

	### Results

	When evaluated on a 80-20 split we obtained:

	• Accuracy: 99.10%

	• Macro Average F1-Score: 0.98

	• Weighted Average F1-Score: 0.99

	Meanwhile the dummy achieved 86.6% accuracy.

	#### Summary

	The model performance is