Spam Detection — Arabic (Naive Bayes)

A spam/ham text classifier for Arabic messages, built with a custom Arabic-aware preprocessing pipeline (tatweel/tashkeel stripping, tokenization, stopword removal) and TF-IDF features feeding into a Multinomial Naive Bayes classifier.

Model Details

Architecture: TF-IDF + Multinomial Naive Bayes (scikit-learn Pipeline)
Preprocessing: Custom transformer — hashtag/punctuation removal, tatweel (تطويل) and tashkeel (تشكيل) stripping via pyarabic, tokenization, Arabic stopword removal
Hyperparameters: Tuned via GridSearchCV (alpha smoothing)
Accuracy: 97.6% on held-out test set

Intended Use

Binary spam classification for Arabic text messages/emails. Part of a multilingual spam detection system that automatically routes text to a language-specific model (English or Arabic) based on detected language (via langdetect).

How to Use

import joblib

model = joblib.load("spam_ar_nb.joblib") 
prediction = model.predict(["مبروك! لقد ربحت جائزة مجانية، اضغط هنا الآن"]) 
print(prediction) # 1 = spam, 0 = ham

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Accuracy on Arabic Spam Dataset
self-reported

0.976