Arabic Restaurant Complaints Classifier

Single-label classification of Arabic restaurant complaints into 8 actionable categories. Specialized for Saudi-Gulf dialect; trained on ~98K reviews from Saudi delivery platforms (HungerStation, Jahez, Mrsool, Talabat) plus production data.

This model card covers the production 8-class model (deployed) and the v5 ambience experiment (research-stage 9-class model with a re-introduced الجو والمكان class).

Live demo: https://huggingface.co/spaces/FerasMad/arabic-complaints-classifier Source: https://github.com/FerasMad/NLP-complaints-system Companion dataset (v5 work): https://huggingface.co/datasets/FerasMad/arabic-restaurant-ambience

Categories

Arabic English Test F1
جودة الطعام Food quality 96.2%
خدمة الموظفين Staff service 95.8%
النظافة Cleanliness 95.3%
السعر والقيمة Price / value 94.7%
وقت الانتظار Wait time 91.9%
التوصيل Delivery 90.1%
دقة الطلب Order accuracy 86.7%
عامة General (no specific aspect) 84.9%

Performance

Held-out test set of 13,986 real reviews, never seen in training:

Metric Value
Accuracy 95.05%
Weighted F1 95.08%
Macro F1 92.03%
Min class F1 (عامة) 84.84%
Bootstrap 95% CI (accuracy) [94.70%, 95.41%]
ECE (after temperature scaling, T=1.523) 0.014

The keyword-rescue layer in the deployed Space trades 0.93% test accuracy for a 15-point gain on a 34-case behavioral audit (85% → 100%). See the GitHub repo for the rescue logic and audit set.

Intended use

Triage and routing of Arabic restaurant feedback for Saudi/Gulf operators. Use cases:

  • Customer-feedback dashboards: tag each complaint and route to the right team
  • Product analytics: aggregate complaint volume per category over time
  • Quality programs: prioritize improvement areas by complaint share

Out-of-scope / known limitations

  • Single-label only. The model picks one category. Multi-aspect complaints ("food was cold AND staff was rude") are not natively decomposed — the deployed Space adds a heuristic display layer for these. A true multi-label retrain is the next planned improvement.
  • Dialect bias. Trained almost entirely on Saudi/Gulf dialect. Cross-dialect canary scores (out of ~50 single-aspect probes per dialect):
    • Saudi: ~67%
    • Levantine: ~60%
    • Egyptian / MSA: ~50%
    • For non-Gulf dialects, treat predictions as advisory.
  • Domain bound to restaurants. The model was trained on restaurant complaints only. Don't apply to other product categories without retraining.
  • No ambience category in production model. v3 had a "الجو والمكان" class; v4 dropped it after a manual audit found ~99% of the gold labels were noise. The production 8-class model does not predict ambience. A 9-class v5 experiment exists that re-introduces ambience — see "V5 ambience experiment" section below. The v5 model achieves 89.22% ambience F1 on a hand-crafted adversarial fixture but is research-stage, not deployed.
  • No abstain mechanism in the raw model. The deployed Space adds short-input abstain (length < 3 chars) and reframes "عامة" predictions as "no specific aspect detected." If you call the model directly, you won't get those guards.
  • PII handling is the caller's responsibility. The model has no built-in PII scrubbing.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

REPO = "FerasMad/arabic-complaints-classifier"
LABELS = [
    "التوصيل", "السعر والقيمة", "النظافة", "جودة الطعام",
    "خدمة الموظفين", "دقة الطلب", "عامة", "وقت الانتظار",
]

tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO).eval()

text = "الاكل بايخ ومالح والطبخ مو متقن"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=192)
with torch.no_grad():
    probs = model(**inputs).logits.softmax(-1)[0]
top_idx = int(probs.argmax())
print(f"{LABELS[top_idx]}: {probs[top_idx]:.2%}")
# جودة الطعام: 98.4%

For temperature-scaled probabilities, divide logits by T = 1.523 before softmax. For the keyword-rescue layer and aspect-extraction interpretability, use the deployed Space or copy hf_space/app.py from the GitHub repo.

Training details

  • Base model: CAMeL-Lab/bert-base-arabic-camelbert-mix
  • Architecture: BERT base (110M params) + classification head over 8 labels
  • Sequence length: 192 tokens
  • Training data: ~98K Arabic complaints from Saudi delivery platforms + production data, manually labeled by the AI Club NLP team
  • Augmentation: EDA (Easy Data Augmentation) for under-represented classes (دقة الطلب, عامة) only
  • Calibration: Temperature scaling with T=1.523, fit on the validation set by NLL minimization

The deployed system uses an ensemble of 4 BERT variants (CAMeLBERT-mix × 2 seeds, MARBERT, AraBERTv02) — see the GitHub repo. This single model is the lightest deployable variant and is what powers the public HF Space (free-tier memory constraint).

Evaluation methodology

  • Held-out test set of 13,986 reviews, sampled per-source and stratified per-class
  • Bootstrap 95% CI computed over 1000 resamples
  • Calibration assessed via ECE on the test set
  • Cross-dialect canary set written by hand to probe dialect generalization
  • Behavioral audit: 34 hand-written test cases (single + multi-aspect) covering all 8 categories — this is what the deployed rescue layer optimizes for

V5 ambience experiment (9-class, research-stage)

After the 8-class production model shipped, the v3-dropped ambience class was re-attempted from scratch with new infrastructure: PySarf morphology, weak-labeled real data from public HF datasets, MLM continued pre-training on 45K Arabic restaurant + hotel reviews, focal loss, and threshold-tuned inference. Three models exist (all research-stage, not deployed):

Model Adversarial Ambience F1 Real-world test Ambience F1 Best for
Run 4 (models/single_ambience_v1_pretrained) 89.22% 69.19% Adversarial benchmark
Run 5 (models/single_ambience_v1_pretrained_v2) 88.24% 94.09% Real-world deployment
Run 4 + Run 5 ensemble (softmax averaging) 88.15% (~80-85% expected) Best real-world balance

All three apply a post-hoc decision threshold (ambience_threshold ∈ [0.005, 0.02]) and an abstain wrapper (length gate + restaurant-domain OOD gate) at inference.

V5 evaluation methodology

  • Adversarial fixture: 183 hand-written cases across 13 attack types (clean, boundary, negation, sarcasm, mixed dialect, multi-aspect, short, very_short, long, out_of_domain, typo, emoji, adversarial). Released as part of the companion dataset.
  • Real-world val/test: 15% / 15% stratified splits of the combined baseline (95K rows) + 5,000 weak-labeled HARD ambience candidates.

V5 ambience class metrics (Run 4, threshold 0.01 + abstain)

Metric Value
Ambience precision 86.67%
Ambience recall 91.92%
Ambience F1 89.22%
Per-attack-type pass rate (worst) out_of_domain 50%
Per-attack-type pass rate (best) sarcasm + emoji + short + very_short = 100%

V5 known limitations

  • Multi-aspect cases stuck at 40% — single-label classification can't decompose "الاكل بارد والمكيف خربان". Multi-label retraining is the only structural fix.
  • Out-of-domain handling at 50% — abstain wrapper catches half. A dedicated OOD classifier as a preprocessing gate would close more.
  • Trained on weak-labeled real data, not gold-labeled. The 1,256 qaym + HARD ambience candidates have ~50-80% precision per sampling. Hand-labeling would push F1 higher.

Reproducing v5

Full waves 7-10 documented at:

  • docs/V5_TRAINING_RUN_1_RESULTS.md — initial baseline (78.57% F1)
  • docs/V5_TRAINING_RUN_2_THRESHOLD_TUNING.md — threshold to 85.56%
  • docs/V5_TRAINING_RUN_3_REAL_DATA_AND_ABSTAIN.md — 24K real candidates harvested
  • docs/V5_TRAINING_RUN_4_PRETRAINED.md — MLM + weak labels = 89.22%
  • docs/V5_TRAINING_RUN_5_HARD_DATA.md — 5x data, mixed signal

Credits

Built by the NLP team at AI Club:

  • Feras Madkhali — team lead, training (Phase 3), evaluation, deployment
  • Lana — text cleaning pipeline (clean())
  • Khowla — labeling and split decisions
  • Rima & Mohammed — schema design, baseline trials
  • Meshal — deployment side

Special thanks to Rashidbm for PySarf — the Arabic morphology engine used in the deployed Space's aspect-extraction layer.

License

MIT. Upstream model licenses (CAMeLBERT, MARBERT, AraBERTv02) apply to the base architecture — see the GitHub repo's NOTICES file.

Citation

@misc{madkhali2026arcomplaints,
  title  = {Arabic Restaurant Complaints Classifier},
  author = {Madkhali, Feras and the AI Club NLP Team},
  year   = {2026},
  url    = {https://github.com/FerasMad/NLP-complaints-system},
  note   = {Saudi-Gulf dialect specialization. 8-class production model
            + 9-class v5 ambience experiment (89.22% adversarial F1).
            Companion dataset at huggingface.co/datasets/FerasMad/arabic-restaurant-ambience}
}
Downloads last month
99
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FerasMad/arabic-complaints-classifier

Finetuned
(8)
this model

Evaluation results

  • Test accuracy on Arabic Restaurant Complaints (held-out test set)
    self-reported
    0.951
  • Test macro F1 on Arabic Restaurant Complaints (held-out test set)
    self-reported
    0.920
  • Test weighted F1 on Arabic Restaurant Complaints (held-out test set)
    self-reported
    0.951