Arabic Restaurant Complaints Classifier

Single-label classification of Arabic restaurant complaints into 8 actionable categories. Specialized for Saudi-Gulf dialect; trained on ~98K reviews from Saudi delivery platforms (HungerStation, Jahez, Mrsool, Talabat) plus production data.

This model card covers the production 8-class model (deployed) and the v5 ambience experiment (research-stage 9-class model with a re-introduced الجو والمكان class).

Live demo: https://huggingface.co/spaces/FerasMad/arabic-complaints-classifier Source: https://github.com/FerasMad/NLP-complaints-system Companion dataset (v5 work): https://huggingface.co/datasets/FerasMad/arabic-restaurant-ambience

Arabic	English	Test F1
جودة الطعام	Food quality	96.2%
خدمة الموظفين	Staff service	95.8%
النظافة	Cleanliness	95.3%
السعر والقيمة	Price / value	94.7%
وقت الانتظار	Wait time	91.9%
التوصيل	Delivery	90.1%
دقة الطلب	Order accuracy	86.7%
عامة	General (no specific aspect)	84.9%

Performance

Held-out test set of 13,986 real reviews, never seen in training:

Metric	Value
Accuracy	95.05%
Weighted F1	95.08%
Macro F1	92.03%
Min class F1 (عامة)	84.84%
Bootstrap 95% CI (accuracy)	[94.70%, 95.41%]
ECE (after temperature scaling, T=1.523)	0.014

The keyword-rescue layer in the deployed Space trades 0.93% test accuracy for a 15-point gain on a 34-case behavioral audit (85% → 100%). See the GitHub repo for the rescue logic and audit set.

Intended use

Triage and routing of Arabic restaurant feedback for Saudi/Gulf operators. Use cases:

Customer-feedback dashboards: tag each complaint and route to the right team
Product analytics: aggregate complaint volume per category over time
Quality programs: prioritize improvement areas by complaint share

Out-of-scope / known limitations

Single-label only. The model picks one category. Multi-aspect complaints ("food was cold AND staff was rude") are not natively decomposed — the deployed Space adds a heuristic display layer for these. A true multi-label retrain is the next planned improvement.
Dialect bias. Trained almost entirely on Saudi/Gulf dialect. Cross-dialect canary scores (out of ~50 single-aspect probes per dialect):
- Saudi: ~67%
- Levantine: ~60%
- Egyptian / MSA: ~50%
- For non-Gulf dialects, treat predictions as advisory.
Domain bound to restaurants. The model was trained on restaurant complaints only. Don't apply to other product categories without retraining.
No ambience category in production model. v3 had a "الجو والمكان" class; v4 dropped it after a manual audit found ~99% of the gold labels were noise. The production 8-class model does not predict ambience. A 9-class v5 experiment exists that re-introduces ambience — see "V5 ambience experiment" section below. The v5 model achieves 89.22% ambience F1 on a hand-crafted adversarial fixture but is research-stage, not deployed.
No abstain mechanism in the raw model. The deployed Space adds short-input abstain (length < 3 chars) and reframes "عامة" predictions as "no specific aspect detected." If you call the model directly, you won't get those guards.
PII handling is the caller's responsibility. The model has no built-in PII scrubbing.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

REPO = "FerasMad/arabic-complaints-classifier"
LABELS = [
    "التوصيل", "السعر والقيمة", "النظافة", "جودة الطعام",
    "خدمة الموظفين", "دقة الطلب", "عامة", "وقت الانتظار",
]

tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO).eval()

text = "الاكل بايخ ومالح والطبخ مو متقن"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=192)
with torch.no_grad():
    probs = model(**inputs).logits.softmax(-1)[0]
top_idx = int(probs.argmax())
print(f"{LABELS[top_idx]}: {probs[top_idx]:.2%}")
# جودة الطعام: 98.4%

For temperature-scaled probabilities, divide logits by T = 1.523 before softmax. For the keyword-rescue layer and aspect-extraction interpretability, use the deployed Space or copy hf_space/app.py from the GitHub repo.

Training details

Base model: CAMeL-Lab/bert-base-arabic-camelbert-mix
Architecture: BERT base (110M params) + classification head over 8 labels
Sequence length: 192 tokens
Training data: ~98K Arabic complaints from Saudi delivery platforms + production data, manually labeled by the AI Club NLP team
Augmentation: EDA (Easy Data Augmentation) for under-represented classes (دقة الطلب, عامة) only
Calibration: Temperature scaling with T=1.523, fit on the validation set by NLL minimization

The deployed system uses an ensemble of 4 BERT variants (CAMeLBERT-mix × 2 seeds, MARBERT, AraBERTv02) — see the GitHub repo. This single model is the lightest deployable variant and is what powers the public HF Space (free-tier memory constraint).

Evaluation methodology

Held-out test set of 13,986 reviews, sampled per-source and stratified per-class
Bootstrap 95% CI computed over 1000 resamples
Calibration assessed via ECE on the test set
Cross-dialect canary set written by hand to probe dialect generalization
Behavioral audit: 34 hand-written test cases (single + multi-aspect) covering all 8 categories — this is what the deployed rescue layer optimizes for

V5 ambience experiment (9-class, research-stage)

After the 8-class production model shipped, the v3-dropped ambience class was re-attempted from scratch with new infrastructure: PySarf morphology, weak-labeled real data from public HF datasets, MLM continued pre-training on 45K Arabic restaurant + hotel reviews, focal loss, and threshold-tuned inference. Three models exist (all research-stage, not deployed):

Model	Adversarial Ambience F1	Real-world test Ambience F1	Best for
Run 4 (`models/single_ambience_v1_pretrained`)	89.22%	69.19%	Adversarial benchmark
Run 5 (`models/single_ambience_v1_pretrained_v2`)	88.24%	94.09%	Real-world deployment
Run 4 + Run 5 ensemble (softmax averaging)	88.15%	(~80-85% expected)	Best real-world balance

All three apply a post-hoc decision threshold (ambience_threshold ∈ [0.005, 0.02]) and an abstain wrapper (length gate + restaurant-domain OOD gate) at inference.

V5 evaluation methodology

Adversarial fixture: 183 hand-written cases across 13 attack types (clean, boundary, negation, sarcasm, mixed dialect, multi-aspect, short, very_short, long, out_of_domain, typo, emoji, adversarial). Released as part of the companion dataset.
Real-world val/test: 15% / 15% stratified splits of the combined baseline (95K rows) + 5,000 weak-labeled HARD ambience candidates.

V5 ambience class metrics (Run 4, threshold 0.01 + abstain)

Metric	Value
Ambience precision	86.67%
Ambience recall	91.92%
Ambience F1	89.22%
Per-attack-type pass rate (worst)	out_of_domain 50%
Per-attack-type pass rate (best)	sarcasm + emoji + short + very_short = 100%

V5 known limitations

Multi-aspect cases stuck at 40% — single-label classification can't decompose "الاكل بارد والمكيف خربان". Multi-label retraining is the only structural fix.
Out-of-domain handling at 50% — abstain wrapper catches half. A dedicated OOD classifier as a preprocessing gate would close more.
Trained on weak-labeled real data, not gold-labeled. The 1,256 qaym + HARD ambience candidates have ~50-80% precision per sampling. Hand-labeling would push F1 higher.

Reproducing v5

Full waves 7-10 documented at:

docs/V5_TRAINING_RUN_1_RESULTS.md — initial baseline (78.57% F1)
docs/V5_TRAINING_RUN_2_THRESHOLD_TUNING.md — threshold to 85.56%
docs/V5_TRAINING_RUN_3_REAL_DATA_AND_ABSTAIN.md — 24K real candidates harvested
docs/V5_TRAINING_RUN_4_PRETRAINED.md — MLM + weak labels = 89.22%
docs/V5_TRAINING_RUN_5_HARD_DATA.md — 5x data, mixed signal

Credits

Built by the NLP team at AI Club:

Feras Madkhali — team lead, training (Phase 3), evaluation, deployment
Lana — text cleaning pipeline (clean())
Khowla — labeling and split decisions
Rima & Mohammed — schema design, baseline trials
Meshal — deployment side

Special thanks to Rashidbm for PySarf — the Arabic morphology engine used in the deployed Space's aspect-extraction layer.

License

MIT. Upstream model licenses (CAMeLBERT, MARBERT, AraBERTv02) apply to the base architecture — see the GitHub repo's NOTICES file.

Citation

@misc{madkhali2026arcomplaints,
  title  = {Arabic Restaurant Complaints Classifier},
  author = {Madkhali, Feras and the AI Club NLP Team},
  year   = {2026},
  url    = {https://github.com/FerasMad/NLP-complaints-system},
  note   = {Saudi-Gulf dialect specialization. 8-class production model
            + 9-class v5 ambience experiment (89.22% adversarial F1).
            Companion dataset at huggingface.co/datasets/FerasMad/arabic-restaurant-ambience}
}

Downloads last month: 99

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for FerasMad/arabic-complaints-classifier

Base model

CAMeL-Lab/bert-base-arabic-camelbert-mix

Finetuned

(8)

this model

Evaluation results

Test accuracy on Arabic Restaurant Complaints (held-out test set)
self-reported

0.951
Test macro F1 on Arabic Restaurant Complaints (held-out test set)
self-reported

0.920
Test weighted F1 on Arabic Restaurant Complaints (held-out test set)
self-reported

0.951

FerasMad
/

arabic-complaints-classifier