DistilBERT Spam Classification (Binary)
An end-to-end project for spam vs ham text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using distilbert-base-uncased [file:1].
Task
- Type: Text classification [file:1]
- Labels:
ham(0),spam(1) [file:1]
Dataset
- Main columns:
Content(training text) [file:1]SpamHam(label: spam/ham, with some missingNaNentries) [file:1]
- Dataset snapshot shown in the notebook:
- Total rows: 2241 [file:1]
spam: 830,ham: 776,NaN: 635 [file:1]
- Extra feature engineering:
URLSourcederived from URL patterns (twitter / youtube / other) [file:1]
Preprocessing
- Fill missing
Contentvalues usingTweetTextwhen available [file:1]. - Remove unused columns (e.g.,
Time,Date2,Date,Author,URL,TweetText) [file:1]. - Stratified split:
- Test split: 15% [file:1]
- Validation split: 15% from the remaining train/val set [file:1]
Models
Classical ML Baselines
- Features:
- Word TF‑IDF n‑grams: (1, 2) [file:1]
- Character TF‑IDF n‑grams: (3, 5) [file:1]
- Combined with
FeatureUnion[file:1]
- Algorithms:
- Logistic Regression (
class_weight="balanced") [file:1] - LinearSVC (
class_weight="balanced") [file:1]
- Logistic Regression (
Transformer Fine-Tuning (DistilBERT)
- Base model:
distilbert-base-uncased[file:1] - Tokenization:
MAXLEN = 128[file:1]
- Training (Trainer):
- Learning rate:
2e-5[file:1] - Train batch size:
16[file:1] - Eval batch size:
32[file:1] - Epochs:
5[file:1] - Best model selected by F1 [file:1]
- Learning rate:
Results
Logistic Regression (TF‑IDF)
- Validation accuracy: ~0.971 [file:1]
- Test accuracy: ~0.946 [file:1]
LinearSVC (TF‑IDF)
- Validation accuracy: ~0.967 [file:1]
- Test accuracy: ~0.971 [file:1]
How to Run
pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn
- Downloads last month
- 5