Text Classification
Transformers
Safetensors
English
distilbert
spam-detection
nlp
text-embeddings-inference
Instructions to use Abdallah-Thuieb/distilbert-spam-classification-binary with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Abdallah-Thuieb/distilbert-spam-classification-binary with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Abdallah-Thuieb/distilbert-spam-classification-binary")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Abdallah-Thuieb/distilbert-spam-classification-binary") model = AutoModelForSequenceClassification.from_pretrained("Abdallah-Thuieb/distilbert-spam-classification-binary") - Notebooks
- Google Colab
- Kaggle
DistilBERT Spam Classification (Binary)
An end-to-end project for spam vs ham text classification, including dataset cleaning, classical ML baselines (TF‑IDF + linear models), and Transformer fine‑tuning using distilbert-base-uncased [file:1].
Task
- Type: Text classification [file:1]
- Labels:
ham(0),spam(1) [file:1]
Dataset
- Main columns:
Content(training text) [file:1]SpamHam(label: spam/ham, with some missingNaNentries) [file:1]
- Dataset snapshot shown in the notebook:
- Total rows: 2241 [file:1]
spam: 830,ham: 776,NaN: 635 [file:1]
- Extra feature engineering:
URLSourcederived from URL patterns (twitter / youtube / other) [file:1]
Preprocessing
- Fill missing
Contentvalues usingTweetTextwhen available [file:1]. - Remove unused columns (e.g.,
Time,Date2,Date,Author,URL,TweetText) [file:1]. - Stratified split:
- Test split: 15% [file:1]
- Validation split: 15% from the remaining train/val set [file:1]
Models
Classical ML Baselines
- Features:
- Word TF‑IDF n‑grams: (1, 2) [file:1]
- Character TF‑IDF n‑grams: (3, 5) [file:1]
- Combined with
FeatureUnion[file:1]
- Algorithms:
- Logistic Regression (
class_weight="balanced") [file:1] - LinearSVC (
class_weight="balanced") [file:1]
- Logistic Regression (
Transformer Fine-Tuning (DistilBERT)
- Base model:
distilbert-base-uncased[file:1] - Tokenization:
MAXLEN = 128[file:1]
- Training (Trainer):
- Learning rate:
2e-5[file:1] - Train batch size:
16[file:1] - Eval batch size:
32[file:1] - Epochs:
5[file:1] - Best model selected by F1 [file:1]
- Learning rate:
Results
Logistic Regression (TF‑IDF)
- Validation accuracy: ~0.971 [file:1]
- Test accuracy: ~0.946 [file:1]
LinearSVC (TF‑IDF)
- Validation accuracy: ~0.967 [file:1]
- Test accuracy: ~0.971 [file:1]
How to Run
pip install -q --upgrade transformers accelerate datasets ipywidgets tqdm scikit-learn
- Downloads last month
- 3