Spam Detection — Arabic (Naive Bayes)
A spam/ham text classifier for Arabic messages, built with a custom Arabic-aware preprocessing pipeline (tatweel/tashkeel stripping, tokenization, stopword removal) and TF-IDF features feeding into a Multinomial Naive Bayes classifier.
Model Details
- Architecture: TF-IDF + Multinomial Naive Bayes (scikit-learn Pipeline)
- Preprocessing: Custom transformer — hashtag/punctuation removal, tatweel (تطويل) and tashkeel (تشكيل) stripping via pyarabic, tokenization, Arabic stopword removal
- Hyperparameters: Tuned via GridSearchCV (alpha smoothing)
- Accuracy: 97.6% on held-out test set
Intended Use
Binary spam classification for Arabic text messages/emails. Part of a multilingual spam detection system that automatically routes text to a language-specific model (English or Arabic) based on detected language (via langdetect).
How to Use
import joblib
model = joblib.load("spam_ar_nb.joblib")
prediction = model.predict(["مبروك! لقد ربحت جائزة مجانية، اضغط هنا الآن"])
print(prediction) # 1 = spam, 0 = ham
Evaluation results
- Accuracy on Arabic Spam Datasetself-reported0.976