DistilBERT Spam Classification (Binary)

An end-to-end project for spam vs ham text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using distilbert-base-uncased [file:1].

Task

Type: Text classification [file:1]
Labels: ham (0), spam (1) [file:1]

Dataset

Main columns:
- Content (training text) [file:1]
- SpamHam (label: spam/ham, with some missing NaN entries) [file:1]
Dataset snapshot shown in the notebook:
- Total rows: 2241 [file:1]
- spam: 830, ham: 776, NaN: 635 [file:1]
Extra feature engineering:
- URLSource derived from URL patterns (twitter / youtube / other) [file:1]

Preprocessing

Fill missing Content values using TweetText when available [file:1].
Remove unused columns (e.g., Time, Date2, Date, Author, URL, TweetText) [file:1].
Stratified split:
- Test split: 15% [file:1]
- Validation split: 15% from the remaining train/val set [file:1]

Models

Classical ML Baselines

Features:
- Word TF‑IDF n‑grams: (1, 2) [file:1]
- Character TF‑IDF n‑grams: (3, 5) [file:1]
- Combined with FeatureUnion [file:1]
Algorithms:
- Logistic Regression (class_weight="balanced") [file:1]
- LinearSVC (class_weight="balanced") [file:1]

Transformer Fine-Tuning (DistilBERT)

Base model: distilbert-base-uncased [file:1]
Tokenization:
- MAXLEN = 128 [file:1]
Training (Trainer):
- Learning rate: 2e-5 [file:1]
- Train batch size: 16 [file:1]
- Eval batch size: 32 [file:1]
- Epochs: 5 [file:1]
- Best model selected by F1 [file:1]

Results

Logistic Regression (TF‑IDF)

Validation accuracy: ~0.971 [file:1]
Test accuracy: ~0.946 [file:1]

LinearSVC (TF‑IDF)

Validation accuracy: ~0.967 [file:1]
Test accuracy: ~0.971 [file:1]

How to Run

pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn

Downloads last month: -

Safetensors

Model size

67M params

Tensor type

F32