Update README.md
Browse files
README.md
CHANGED
|
@@ -1,115 +1,80 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
Type: Text classification
|
| 7 |
|
| 8 |
-
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
| 12 |
-
The notebook works with a dataset containing social posts and labels:
|
| 13 |
|
| 14 |
-
|
|
|
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
Features:
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
|
|
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
Logistic Regression (class_weight="balanced")
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
LinearSVC (class_weight="balanced")
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
Transformer Fine-Tuning (DistilBERT)
|
| 66 |
-
Base model: distilbert-base-uncased
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
Tokenization:
|
| 70 |
-
|
| 71 |
-
MAXLEN = 128
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
Training setup (Trainer):
|
| 75 |
-
|
| 76 |
-
Learning rate: 2e-5
|
| 77 |
-
|
| 78 |
-
Train batch size: 16
|
| 79 |
-
|
| 80 |
-
Eval batch size: 32
|
| 81 |
-
|
| 82 |
-
Epochs: 5
|
| 83 |
-
|
| 84 |
-
Best model selected by F1
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
Results (from the notebook)
|
| 88 |
-
Logistic Regression (TF‑IDF)
|
| 89 |
-
Validation accuracy: ~0.971
|
| 90 |
-
|
| 91 |
-
Test accuracy: ~0.946
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
LinearSVC (TF‑IDF)
|
| 95 |
-
Validation accuracy: ~0.967
|
| 96 |
-
|
| 97 |
-
Test accuracy: ~0.971
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
How to Run
|
| 101 |
-
Install dependencies:
|
| 102 |
-
|
| 103 |
-
bash
|
| 104 |
pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn
|
| 105 |
-
Then run the notebook end-to-end to reproduce preprocessing, training, evaluation, and model export.
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
Saved Artifacts
|
| 109 |
-
The notebook saves the fine-tuned model and tokenizer files (e.g., model.safetensors, config.json, tokenizer.json, tokenizer_config.json, vocab.txt).
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
Notes / Limitations
|
| 113 |
-
Performance depends on the dataset distribution and labeling quality; some rows in the source file are unlabeled (NaN) and are excluded from supervised training.
|
| 114 |
-
|
| 115 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
pipeline_tag: text-classification
|
| 5 |
+
library_name: transformers
|
| 6 |
+
tags:
|
| 7 |
+
- spam-detection
|
| 8 |
+
- text-classification
|
| 9 |
+
- distilbert
|
| 10 |
+
- nlp
|
| 11 |
+
---
|
| 12 |
|
| 13 |
+
# DistilBERT Spam Classification (Binary)
|
|
|
|
| 14 |
|
| 15 |
+
An end-to-end project for **spam vs ham** text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1].
|
|
|
|
| 16 |
|
| 17 |
+
## Task
|
|
|
|
| 18 |
|
| 19 |
+
- **Type:** Text classification [file:1]
|
| 20 |
+
- **Labels:** `ham` (0), `spam` (1) [file:1]
|
| 21 |
|
| 22 |
+
## Dataset
|
|
|
|
| 23 |
|
| 24 |
+
- Main columns:
|
| 25 |
+
- `Content` (training text) [file:1]
|
| 26 |
+
- `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1]
|
| 27 |
+
- Dataset snapshot shown in the notebook:
|
| 28 |
+
- Total rows: **2241** [file:1]
|
| 29 |
+
- `spam`: **830**, `ham`: **776**, `NaN`: **635** [file:1]
|
| 30 |
+
- Extra feature engineering:
|
| 31 |
+
- `URLSource` derived from URL patterns (twitter / youtube / other) [file:1]
|
| 32 |
|
| 33 |
+
## Preprocessing
|
| 34 |
|
| 35 |
+
- Fill missing `Content` values using `TweetText` when available [file:1].
|
| 36 |
+
- Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1].
|
| 37 |
+
- Stratified split:
|
| 38 |
+
- Test split: **15%** [file:1]
|
| 39 |
+
- Validation split: **15%** from the remaining train/val set [file:1]
|
| 40 |
|
| 41 |
+
## Models
|
| 42 |
|
| 43 |
+
### Classical ML Baselines
|
|
|
|
| 44 |
|
| 45 |
+
- Features:
|
| 46 |
+
- Word TF‑IDF n‑grams: (1, 2) [file:1]
|
| 47 |
+
- Character TF‑IDF n‑grams: (3, 5) [file:1]
|
| 48 |
+
- Combined with `FeatureUnion` [file:1]
|
| 49 |
+
- Algorithms:
|
| 50 |
+
- Logistic Regression (`class_weight="balanced"`) [file:1]
|
| 51 |
+
- LinearSVC (`class_weight="balanced"`) [file:1]
|
| 52 |
|
| 53 |
+
### Transformer Fine-Tuning (DistilBERT)
|
|
|
|
| 54 |
|
| 55 |
+
- Base model: `distilbert-base-uncased` [file:1]
|
| 56 |
+
- Tokenization:
|
| 57 |
+
- `MAXLEN = 128` [file:1]
|
| 58 |
+
- Training (Trainer):
|
| 59 |
+
- Learning rate: `2e-5` [file:1]
|
| 60 |
+
- Train batch size: `16` [file:1]
|
| 61 |
+
- Eval batch size: `32` [file:1]
|
| 62 |
+
- Epochs: `5` [file:1]
|
| 63 |
+
- Best model selected by **F1** [file:1]
|
| 64 |
|
| 65 |
+
## Results
|
| 66 |
|
| 67 |
+
### Logistic Regression (TF‑IDF)
|
|
|
|
| 68 |
|
| 69 |
+
- Validation accuracy: **~0.971** [file:1]
|
| 70 |
+
- Test accuracy: **~0.946** [file:1]
|
|
|
|
| 71 |
|
| 72 |
+
### LinearSVC (TF‑IDF)
|
| 73 |
|
| 74 |
+
- Validation accuracy: **~0.967** [file:1]
|
| 75 |
+
- Test accuracy: **~0.971** [file:1]
|
| 76 |
|
| 77 |
+
## How to Run
|
|
|
|
| 78 |
|
| 79 |
+
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|