Abdallah-Thuieb
/

distilbert-spam-classification-binary

@@ -1,115 +1,80 @@
-DistilBERT Spam Classification (Binary)
-An end-to-end project for spam vs ham text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using distilbert-base-uncased.
-Task
-Type: Text classification
-Labels: ham (0), spam (1)
-Dataset
-The notebook works with a dataset containing social posts and labels:
-Main columns:
-Content (training text)
-SpamHam (label: spam/ham, with some missing NaN entries)
-Total rows: 2241
-spam: 830, ham: 776, NaN: 635
-Extra feature engineering:
-URLSource derived from URL patterns (e.g., twitter / youtube / other)
-Preprocessing
-Missing Content values are filled using TweetText when available.
-Unused columns are removed (e.g., Time, Date2, Date, Author, URL, TweetText).
-Train/Val/Test split (stratified):
-Test split: 15%
-Validation split: 15% from the remaining train/val set
-Models
-Classical ML Baselines
-Features:
-Word TF‑IDF n‑grams: (1, 2)
-Character TF‑IDF n‑grams: (3, 5)
-Combined with FeatureUnion
-Algorithms:
-Logistic Regression (class_weight="balanced")
-LinearSVC (class_weight="balanced")
-Transformer Fine-Tuning (DistilBERT)
-Base model: distilbert-base-uncased
-Tokenization:
-MAXLEN = 128
-Training setup (Trainer):
-Learning rate: 2e-5
-Train batch size: 16
-Eval batch size: 32
-Epochs: 5
-Best model selected by F1
-Results (from the notebook)
-Logistic Regression (TF‑IDF)
-Validation accuracy: ~0.971
-Test accuracy: ~0.946
-LinearSVC (TF‑IDF)
-Validation accuracy: ~0.967
-Test accuracy: ~0.971
-How to Run
-Install dependencies:
-bash
 pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn
-Then run the notebook end-to-end to reproduce preprocessing, training, evaluation, and model export.
-Saved Artifacts
-The notebook saves the fine-tuned model and tokenizer files (e.g., model.safetensors, config.json, tokenizer.json, tokenizer_config.json, vocab.txt).
-Notes / Limitations
-Performance depends on the dataset distribution and labeling quality; some rows in the source file are unlabeled (NaN) and are excluded from supervised training.

+---
+language: en
+license: mit
+pipeline_tag: text-classification
+library_name: transformers
+tags:
+- spam-detection
+- text-classification
+- distilbert
+- nlp
+---
+# DistilBERT Spam Classification (Binary)
+An end-to-end project for **spam vs ham** text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1].
+## Task
+- **Type:** Text classification [file:1]
+- **Labels:** `ham` (0), `spam` (1) [file:1]
+## Dataset
+- Main columns:
+  - `Content` (training text) [file:1]
+  - `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1]
+- Dataset snapshot shown in the notebook:
+  - Total rows: **2241** [file:1]
+  - `spam`: **830**, `ham`: **776**, `NaN`: **635** [file:1]
+- Extra feature engineering:
+  - `URLSource` derived from URL patterns (twitter / youtube / other) [file:1]
+## Preprocessing
+- Fill missing `Content` values using `TweetText` when available [file:1].
+- Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1].
+- Stratified split:
+  - Test split: **15%** [file:1]
+  - Validation split: **15%** from the remaining train/val set [file:1]
+## Models
+### Classical ML Baselines
+- Features:
+  - Word TF‑IDF n‑grams: (1, 2) [file:1]
+  - Character TF‑IDF n‑grams: (3, 5) [file:1]
+  - Combined with `FeatureUnion` [file:1]
+- Algorithms:
+  - Logistic Regression (`class_weight="balanced"`) [file:1]
+  - LinearSVC (`class_weight="balanced"`) [file:1]
+### Transformer Fine-Tuning (DistilBERT)
+- Base model: `distilbert-base-uncased` [file:1]
+- Tokenization:
+  - `MAXLEN = 128` [file:1]
+- Training (Trainer):
+  - Learning rate: `2e-5` [file:1]
+  - Train batch size: `16` [file:1]
+  - Eval batch size: `32` [file:1]
+  - Epochs: `5` [file:1]
+  - Best model selected by **F1** [file:1]
+## Results
+### Logistic Regression (TF‑IDF)
+- Validation accuracy: **~0.971** [file:1]
+- Test accuracy: **~0.946** [file:1]
+### LinearSVC (TF‑IDF)
+- Validation accuracy: **~0.967** [file:1]
+- Test accuracy: **~0.971** [file:1]
+## How to Run
+```bash
 pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn