---
language: en
license: mit
pipeline_tag: text-classification
library_name: transformers
tags:
- spam-detection
- text-classification
- distilbert
- nlp
---

# DistilBERT Spam Classification (Binary)

An end-to-end project for **spam vs ham** text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1].

## Task

- **Type:** Text classification [file:1]
- **Labels:** `ham` (0), `spam` (1) [file:1]

## Dataset

- Main columns:
  - `Content` (training text) [file:1]
  - `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1]
- Dataset snapshot shown in the notebook:
  - Total rows: **2241** [file:1]
  - `spam`: **830**, `ham`: **776**, `NaN`: **635** [file:1]
- Extra feature engineering:
  - `URLSource` derived from URL patterns (twitter / youtube / other) [file:1]

## Preprocessing

- Fill missing `Content` values using `TweetText` when available [file:1].
- Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1].
- Stratified split:
  - Test split: **15%** [file:1]
  - Validation split: **15%** from the remaining train/val set [file:1]

## Models

### Classical ML Baselines

- Features:
  - Word TF‑IDF n‑grams: (1, 2) [file:1]
  - Character TF‑IDF n‑grams: (3, 5) [file:1]
  - Combined with `FeatureUnion` [file:1]
- Algorithms:
  - Logistic Regression (`class_weight="balanced"`) [file:1]
  - LinearSVC (`class_weight="balanced"`) [file:1]

### Transformer Fine-Tuning (DistilBERT)

- Base model: `distilbert-base-uncased` [file:1]
- Tokenization:
  - `MAXLEN = 128` [file:1]
- Training (Trainer):
  - Learning rate: `2e-5` [file:1]
  - Train batch size: `16` [file:1]
  - Eval batch size: `32` [file:1]
  - Epochs: `5` [file:1]
  - Best model selected by **F1** [file:1]

## Results

### Logistic Regression (TF‑IDF)

- Validation accuracy: **~0.971** [file:1]
- Test accuracy: **~0.946** [file:1]

### LinearSVC (TF‑IDF)

- Validation accuracy: **~0.967** [file:1]
- Test accuracy: **~0.971** [file:1]

## How to Run

```bash
pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn