Abdallah-Thuieb's picture
Update README.md
b66bfeb verified
---
language: en
license: mit
pipeline_tag: text-classification
library_name: transformers
tags:
- spam-detection
- text-classification
- distilbert
- nlp
---
# DistilBERT Spam Classification (Binary)
An end-to-end project for **spam vs ham** text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1].
## Task
- **Type:** Text classification [file:1]
- **Labels:** `ham` (0), `spam` (1) [file:1]
## Dataset
- Main columns:
- `Content` (training text) [file:1]
- `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1]
- Dataset snapshot shown in the notebook:
- Total rows: **2241** [file:1]
- `spam`: **830**, `ham`: **776**, `NaN`: **635** [file:1]
- Extra feature engineering:
- `URLSource` derived from URL patterns (twitter / youtube / other) [file:1]
## Preprocessing
- Fill missing `Content` values using `TweetText` when available [file:1].
- Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1].
- Stratified split:
- Test split: **15%** [file:1]
- Validation split: **15%** from the remaining train/val set [file:1]
## Models
### Classical ML Baselines
- Features:
- Word TF‑IDF n‑grams: (1, 2) [file:1]
- Character TF‑IDF n‑grams: (3, 5) [file:1]
- Combined with `FeatureUnion` [file:1]
- Algorithms:
- Logistic Regression (`class_weight="balanced"`) [file:1]
- LinearSVC (`class_weight="balanced"`) [file:1]
### Transformer Fine-Tuning (DistilBERT)
- Base model: `distilbert-base-uncased` [file:1]
- Tokenization:
- `MAXLEN = 128` [file:1]
- Training (Trainer):
- Learning rate: `2e-5` [file:1]
- Train batch size: `16` [file:1]
- Eval batch size: `32` [file:1]
- Epochs: `5` [file:1]
- Best model selected by **F1** [file:1]
## Results
### Logistic Regression (TF‑IDF)
- Validation accuracy: **~0.971** [file:1]
- Test accuracy: **~0.946** [file:1]
### LinearSVC (TF‑IDF)
- Validation accuracy: **~0.967** [file:1]
- Test accuracy: **~0.971** [file:1]
## How to Run
```bash
pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn