Abdallah-Thuieb
/

distilbert-spam-classification-binary

Text Classification

text-embeddings-inference

Model card Files Files and versions

distilbert-spam-classification-binary / README.md

Abdallah-Thuieb's picture

Abdallah-Thuieb

Update README.md

b66bfeb verified about 2 months ago

|

history blame contribute delete

2.21 kB

	---
	language: en
	license: mit
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- spam-detection
	- text-classification
	- distilbert
	- nlp
	---

	# DistilBERT Spam Classification (Binary)

	An end-to-end project for spam vs ham text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1].

	## Task

	- Type: Text classification [file:1]
	- Labels: `ham` (0), `spam` (1) [file:1]

	## Dataset

	- Main columns:
	- `Content` (training text) [file:1]
	- `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1]
	- Dataset snapshot shown in the notebook:
	- Total rows: 2241 [file:1]
	- `spam`: 830, `ham`: 776, `NaN`: 635 [file:1]
	- Extra feature engineering:
	- `URLSource` derived from URL patterns (twitter / youtube / other) [file:1]

	## Preprocessing

	- Fill missing `Content` values using `TweetText` when available [file:1].
	- Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1].
	- Stratified split:
	- Test split: 15% [file:1]
	- Validation split: 15% from the remaining train/val set [file:1]

	## Models

	### Classical ML Baselines

	- Features:
	- Word TF‑IDF n‑grams: (1, 2) [file:1]
	- Character TF‑IDF n‑grams: (3, 5) [file:1]
	- Combined with `FeatureUnion` [file:1]
	- Algorithms:
	- Logistic Regression (`class_weight="balanced"`) [file:1]
	- LinearSVC (`class_weight="balanced"`) [file:1]

	### Transformer Fine-Tuning (DistilBERT)

	- Base model: `distilbert-base-uncased` [file:1]
	- Tokenization:
	- `MAXLEN = 128` [file:1]
	- Training (Trainer):
	- Learning rate: `2e-5` [file:1]
	- Train batch size: `16` [file:1]
	- Eval batch size: `32` [file:1]
	- Epochs: `5` [file:1]
	- Best model selected by F1 [file:1]

	## Results

	### Logistic Regression (TF‑IDF)

	- Validation accuracy: ~0.971 [file:1]
	- Test accuracy: ~0.946 [file:1]

	### LinearSVC (TF‑IDF)

	- Validation accuracy: ~0.967 [file:1]
	- Test accuracy: ~0.971 [file:1]

	## How to Run

	```bash
	pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn