| | --- |
| | language: en |
| | license: mit |
| | pipeline_tag: text-classification |
| | library_name: transformers |
| | tags: |
| | - spam-detection |
| | - text-classification |
| | - distilbert |
| | - nlp |
| | --- |
| | |
| | # DistilBERT Spam Classification (Binary) |
| |
|
| | An end-to-end project for **spam vs ham** text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1]. |
| |
|
| | ## Task |
| |
|
| | - **Type:** Text classification [file:1] |
| | - **Labels:** `ham` (0), `spam` (1) [file:1] |
| |
|
| | ## Dataset |
| |
|
| | - Main columns: |
| | - `Content` (training text) [file:1] |
| | - `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1] |
| | - Dataset snapshot shown in the notebook: |
| | - Total rows: **2241** [file:1] |
| | - `spam`: **830**, `ham`: **776**, `NaN`: **635** [file:1] |
| | - Extra feature engineering: |
| | - `URLSource` derived from URL patterns (twitter / youtube / other) [file:1] |
| |
|
| | ## Preprocessing |
| |
|
| | - Fill missing `Content` values using `TweetText` when available [file:1]. |
| | - Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1]. |
| | - Stratified split: |
| | - Test split: **15%** [file:1] |
| | - Validation split: **15%** from the remaining train/val set [file:1] |
| |
|
| | ## Models |
| |
|
| | ### Classical ML Baselines |
| |
|
| | - Features: |
| | - Word TF‑IDF n‑grams: (1, 2) [file:1] |
| | - Character TF‑IDF n‑grams: (3, 5) [file:1] |
| | - Combined with `FeatureUnion` [file:1] |
| | - Algorithms: |
| | - Logistic Regression (`class_weight="balanced"`) [file:1] |
| | - LinearSVC (`class_weight="balanced"`) [file:1] |
| |
|
| | ### Transformer Fine-Tuning (DistilBERT) |
| |
|
| | - Base model: `distilbert-base-uncased` [file:1] |
| | - Tokenization: |
| | - `MAXLEN = 128` [file:1] |
| | - Training (Trainer): |
| | - Learning rate: `2e-5` [file:1] |
| | - Train batch size: `16` [file:1] |
| | - Eval batch size: `32` [file:1] |
| | - Epochs: `5` [file:1] |
| | - Best model selected by **F1** [file:1] |
| |
|
| | ## Results |
| |
|
| | ### Logistic Regression (TF‑IDF) |
| |
|
| | - Validation accuracy: **~0.971** [file:1] |
| | - Test accuracy: **~0.946** [file:1] |
| |
|
| | ### LinearSVC (TF‑IDF) |
| |
|
| | - Validation accuracy: **~0.967** [file:1] |
| | - Test accuracy: **~0.971** [file:1] |
| |
|
| | ## How to Run |
| |
|
| | ```bash |
| | pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn |
| | |