Instructions to use picket-cliff/deepl-project-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use picket-cliff/deepl-project-model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="picket-cliff/deepl-project-model")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("picket-cliff/deepl-project-model") model = AutoModelForSequenceClassification.from_pretrained("picket-cliff/deepl-project-model") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| language: | |
| - en | |
| metrics: | |
| - f1 | |
| base_model: | |
| - distilbert/distilbert-base-uncased | |
| datasets: | |
| - AbdulHadi806/mail_spam_ham_dataset | |
| pipeline_tag: text-classification | |
| # Model Card for Model ID | |
| Text classification Model for Spam Detection (Deep Learning Project). | |
| ## Model Details | |
| ### Model Description | |
| Model developped for the "Deep Learning with Python" course Project | |
| - **Developed by:** Diavila Rostaing Engandzi | |
| - **Model type:** Binary Text Classification | |
| - **Language(s) (NLP):** English | |
| - **Finetuned from model:** DistilBERT | |
| ### Model Sources | |
| - **Demo [optional]:** https://huggingface.co/picket-cliff/deepl-project | |
| ## Uses | |
| The model is intended to be used to sort spam in emails. Clone and Run the app.py file in the Demo to see it in action. | |
| ## Training Details | |
| ### Training Data | |
| Subset from the email_data.csv dataset [card]. | |
| A benchmark dataset for email classification with around 5000 emailed classified between "ham" and "spam". | |
| To evaluate the model, data was separated between training and test datasets (80-20 split). | |
| #### Preprocessing | |
| Deep learning models cannot process raw text; they require numerical tensors. We utilized the Hugging Face DistilBertTokenizer. | |
| 1. Sub-word Tokenization: Instead of splitting by spaces (which struggles with typos and rare words), DistilBERT uses WordPiece tokenization. For example, an out-of-vocabulary word might be broken into known sub-words, preventing the model from encountering "Unknown" tokens. | |
| 2. Special Tokens: The tokenizer automatically prepends the [CLS] (Classification) token to the start of the sequence and the [SEP] (Separator) token at the end. The final hidden state corresponding to the [CLS] token is what the model uses for the binary classification decision. | |
| 3. Truncation and Padding: Transformer models require fixed-size input matrices for batch processing. Based on our EDA length distribution, we set max_length = 128. | |
| o Sentences longer than 128 tokens were truncated. | |
| o Sentences shorter than 128 tokens were padded with the [PAD] token (ID 0). | |
| 4. Attention Masks: To prevent the model from performing Self-Attention on meaningless padding tokens, the tokenizer generates an attention_mask (an array of 1s for real words and 0s for padding). | |
| ## Evaluation | |
| Results obtained directly from training on the training dataset then evaluating the model on the testing data. | |
| Result are compared to a baseline (dummy classifier) for reference. | |
| ### Testing Data, Factors & Metrics | |
| #### Testing Data | |
| #### Metrics | |
| Accuracy, f1 score (macro and weighted) | |
| ### Results | |
| When evaluated on a 80-20 split we obtained: | |
| • Accuracy: 99.10% | |
| • Macro Average F1-Score: 0.98 | |
| • Weighted Average F1-Score: 0.99 | |
| Meanwhile the dummy achieved 86.6% accuracy. | |
| #### Summary | |
| The model performance is |