Spam Detection — Arabic (Naive Bayes)

A spam/ham text classifier for Arabic messages, built with a custom Arabic-aware preprocessing pipeline (tatweel/tashkeel stripping, tokenization, stopword removal) and TF-IDF features feeding into a Multinomial Naive Bayes classifier.

Model Details

  • Architecture: TF-IDF + Multinomial Naive Bayes (scikit-learn Pipeline)
  • Preprocessing: Custom transformer — hashtag/punctuation removal, tatweel (تطويل) and tashkeel (تشكيل) stripping via pyarabic, tokenization, Arabic stopword removal
  • Hyperparameters: Tuned via GridSearchCV (alpha smoothing)
  • Accuracy: 97.6% on held-out test set

Intended Use

Binary spam classification for Arabic text messages/emails. Part of a multilingual spam detection system that automatically routes text to a language-specific model (English or Arabic) based on detected language (via langdetect).

How to Use

import joblib

model = joblib.load("spam_ar_nb.joblib") 
prediction = model.predict(["مبروك! لقد ربحت جائزة مجانية، اضغط هنا الآن"]) 
print(prediction) # 1 = spam, 0 = ham
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results