--- language: en license: mit pipeline_tag: text-classification library_name: transformers tags: - spam-detection - text-classification - distilbert - nlp --- # DistilBERT Spam Classification (Binary) An end-to-end project for **spam vs ham** text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1]. ## Task - **Type:** Text classification [file:1] - **Labels:** `ham` (0), `spam` (1) [file:1] ## Dataset - Main columns: - `Content` (training text) [file:1] - `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1] - Dataset snapshot shown in the notebook: - Total rows: **2241** [file:1] - `spam`: **830**, `ham`: **776**, `NaN`: **635** [file:1] - Extra feature engineering: - `URLSource` derived from URL patterns (twitter / youtube / other) [file:1] ## Preprocessing - Fill missing `Content` values using `TweetText` when available [file:1]. - Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1]. - Stratified split: - Test split: **15%** [file:1] - Validation split: **15%** from the remaining train/val set [file:1] ## Models ### Classical ML Baselines - Features: - Word TF‑IDF n‑grams: (1, 2) [file:1] - Character TF‑IDF n‑grams: (3, 5) [file:1] - Combined with `FeatureUnion` [file:1] - Algorithms: - Logistic Regression (`class_weight="balanced"`) [file:1] - LinearSVC (`class_weight="balanced"`) [file:1] ### Transformer Fine-Tuning (DistilBERT) - Base model: `distilbert-base-uncased` [file:1] - Tokenization: - `MAXLEN = 128` [file:1] - Training (Trainer): - Learning rate: `2e-5` [file:1] - Train batch size: `16` [file:1] - Eval batch size: `32` [file:1] - Epochs: `5` [file:1] - Best model selected by **F1** [file:1] ## Results ### Logistic Regression (TF‑IDF) - Validation accuracy: **~0.971** [file:1] - Test accuracy: **~0.946** [file:1] ### LinearSVC (TF‑IDF) - Validation accuracy: **~0.967** [file:1] - Test accuracy: **~0.971** [file:1] ## How to Run ```bash pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn