DistilBERT Spam Classification (Binary)

An end-to-end project for spam vs ham text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using distilbert-base-uncased [file:1].

Task

  • Type: Text classification [file:1]
  • Labels: ham (0), spam (1) [file:1]

Dataset

  • Main columns:
    • Content (training text) [file:1]
    • SpamHam (label: spam/ham, with some missing NaN entries) [file:1]
  • Dataset snapshot shown in the notebook:
    • Total rows: 2241 [file:1]
    • spam: 830, ham: 776, NaN: 635 [file:1]
  • Extra feature engineering:
    • URLSource derived from URL patterns (twitter / youtube / other) [file:1]

Preprocessing

  • Fill missing Content values using TweetText when available [file:1].
  • Remove unused columns (e.g., Time, Date2, Date, Author, URL, TweetText) [file:1].
  • Stratified split:
    • Test split: 15% [file:1]
    • Validation split: 15% from the remaining train/val set [file:1]

Models

Classical ML Baselines

  • Features:
    • Word TF‑IDF n‑grams: (1, 2) [file:1]
    • Character TF‑IDF n‑grams: (3, 5) [file:1]
    • Combined with FeatureUnion [file:1]
  • Algorithms:
    • Logistic Regression (class_weight="balanced") [file:1]
    • LinearSVC (class_weight="balanced") [file:1]

Transformer Fine-Tuning (DistilBERT)

  • Base model: distilbert-base-uncased [file:1]
  • Tokenization:
    • MAXLEN = 128 [file:1]
  • Training (Trainer):
    • Learning rate: 2e-5 [file:1]
    • Train batch size: 16 [file:1]
    • Eval batch size: 32 [file:1]
    • Epochs: 5 [file:1]
    • Best model selected by F1 [file:1]

Results

Logistic Regression (TF‑IDF)

  • Validation accuracy: ~0.971 [file:1]
  • Test accuracy: ~0.946 [file:1]

LinearSVC (TF‑IDF)

  • Validation accuracy: ~0.967 [file:1]
  • Test accuracy: ~0.971 [file:1]

How to Run

pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn
Downloads last month
5
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support