Abdallah-Thuieb commited on
Commit
b66bfeb
·
verified ·
1 Parent(s): 13ddc4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -94
README.md CHANGED
@@ -1,115 +1,80 @@
1
- DistilBERT Spam Classification (Binary)
2
- An end-to-end project for spam vs ham text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using distilbert-base-uncased.
3
-
 
 
 
 
 
 
 
 
4
 
5
- Task
6
- Type: Text classification
7
 
8
- Labels: ham (0), spam (1)
9
-
10
 
11
- Dataset
12
- The notebook works with a dataset containing social posts and labels:
13
 
14
- Main columns:
 
15
 
16
- Content (training text)
17
-
18
 
19
- SpamHam (label: spam/ham, with some missing NaN entries)
20
-
 
 
 
 
 
 
21
 
22
- Total rows: 2241
23
 
24
- spam: 830, ham: 776, NaN: 635
25
-
 
 
 
26
 
27
- Extra feature engineering:
28
 
29
- URLSource derived from URL patterns (e.g., twitter / youtube / other)
30
-
31
 
32
- Preprocessing
33
- Missing Content values are filled using TweetText when available.
34
-
 
 
 
 
35
 
36
- Unused columns are removed (e.g., Time, Date2, Date, Author, URL, TweetText).
37
-
38
 
39
- Train/Val/Test split (stratified):
 
 
 
 
 
 
 
 
40
 
41
- Test split: 15%
42
 
43
- Validation split: 15% from the remaining train/val set
44
-
45
 
46
- Models
47
- Classical ML Baselines
48
- Features:
49
 
50
- Word TF‑IDF n‑grams: (1, 2)
51
 
52
- Character TF‑IDF n‑grams: (3, 5)
 
53
 
54
- Combined with FeatureUnion
55
-
56
 
57
- Algorithms:
58
-
59
- Logistic Regression (class_weight="balanced")
60
-
61
-
62
- LinearSVC (class_weight="balanced")
63
-
64
-
65
- Transformer Fine-Tuning (DistilBERT)
66
- Base model: distilbert-base-uncased
67
-
68
-
69
- Tokenization:
70
-
71
- MAXLEN = 128
72
-
73
-
74
- Training setup (Trainer):
75
-
76
- Learning rate: 2e-5
77
-
78
- Train batch size: 16
79
-
80
- Eval batch size: 32
81
-
82
- Epochs: 5
83
-
84
- Best model selected by F1
85
-
86
-
87
- Results (from the notebook)
88
- Logistic Regression (TF‑IDF)
89
- Validation accuracy: ~0.971
90
-
91
- Test accuracy: ~0.946
92
-
93
-
94
- LinearSVC (TF‑IDF)
95
- Validation accuracy: ~0.967
96
-
97
- Test accuracy: ~0.971
98
-
99
-
100
- How to Run
101
- Install dependencies:
102
-
103
- bash
104
  pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn
105
- Then run the notebook end-to-end to reproduce preprocessing, training, evaluation, and model export.
106
-
107
-
108
- Saved Artifacts
109
- The notebook saves the fine-tuned model and tokenizer files (e.g., model.safetensors, config.json, tokenizer.json, tokenizer_config.json, vocab.txt).
110
-
111
-
112
- Notes / Limitations
113
- Performance depends on the dataset distribution and labeling quality; some rows in the source file are unlabeled (NaN) and are excluded from supervised training.
114
-
115
-
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ pipeline_tag: text-classification
5
+ library_name: transformers
6
+ tags:
7
+ - spam-detection
8
+ - text-classification
9
+ - distilbert
10
+ - nlp
11
+ ---
12
 
13
+ # DistilBERT Spam Classification (Binary)
 
14
 
15
+ An end-to-end project for **spam vs ham** text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using `distilbert-base-uncased` [file:1].
 
16
 
17
+ ## Task
 
18
 
19
+ - **Type:** Text classification [file:1]
20
+ - **Labels:** `ham` (0), `spam` (1) [file:1]
21
 
22
+ ## Dataset
 
23
 
24
+ - Main columns:
25
+ - `Content` (training text) [file:1]
26
+ - `SpamHam` (label: spam/ham, with some missing `NaN` entries) [file:1]
27
+ - Dataset snapshot shown in the notebook:
28
+ - Total rows: **2241** [file:1]
29
+ - `spam`: **830**, `ham`: **776**, `NaN`: **635** [file:1]
30
+ - Extra feature engineering:
31
+ - `URLSource` derived from URL patterns (twitter / youtube / other) [file:1]
32
 
33
+ ## Preprocessing
34
 
35
+ - Fill missing `Content` values using `TweetText` when available [file:1].
36
+ - Remove unused columns (e.g., `Time`, `Date2`, `Date`, `Author`, `URL`, `TweetText`) [file:1].
37
+ - Stratified split:
38
+ - Test split: **15%** [file:1]
39
+ - Validation split: **15%** from the remaining train/val set [file:1]
40
 
41
+ ## Models
42
 
43
+ ### Classical ML Baselines
 
44
 
45
+ - Features:
46
+ - Word TF‑IDF n‑grams: (1, 2) [file:1]
47
+ - Character TF‑IDF n‑grams: (3, 5) [file:1]
48
+ - Combined with `FeatureUnion` [file:1]
49
+ - Algorithms:
50
+ - Logistic Regression (`class_weight="balanced"`) [file:1]
51
+ - LinearSVC (`class_weight="balanced"`) [file:1]
52
 
53
+ ### Transformer Fine-Tuning (DistilBERT)
 
54
 
55
+ - Base model: `distilbert-base-uncased` [file:1]
56
+ - Tokenization:
57
+ - `MAXLEN = 128` [file:1]
58
+ - Training (Trainer):
59
+ - Learning rate: `2e-5` [file:1]
60
+ - Train batch size: `16` [file:1]
61
+ - Eval batch size: `32` [file:1]
62
+ - Epochs: `5` [file:1]
63
+ - Best model selected by **F1** [file:1]
64
 
65
+ ## Results
66
 
67
+ ### Logistic Regression (TF‑IDF)
 
68
 
69
+ - Validation accuracy: **~0.971** [file:1]
70
+ - Test accuracy: **~0.946** [file:1]
 
71
 
72
+ ### LinearSVC (TF‑IDF)
73
 
74
+ - Validation accuracy: **~0.967** [file:1]
75
+ - Test accuracy: **~0.971** [file:1]
76
 
77
+ ## How to Run
 
78
 
79
+ ```bash
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn