File size: 3,464 Bytes

---
license: apache-2.0
language:
- ar
- fr
- en
metrics:
- accuracy
- f1
base_model:
- SI2M-Lab/DarijaBERT-arabizi
pipeline_tag: text-classification
tags:
- darija
- arabizi
- morocco
- bert
---


# Darija Toxicity Classifier 🇲🇦

A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.

This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:
* `3` → ع
* `7` → ح
* `9` → ق

It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.

---

## 📌 Model Overview

| Property | Value |
|----------|-------|
| **Model ID** | `0khacha/darija-toxicity-classifier` |
| **Architecture** | Fine-tuned from `SI2M-Lab/DarijaBERT-arabizi` |
| **Task** | Binary Sequence Classification (Safe / Toxic) |
| **Framework** | Hugging Face Transformers |
| **Training Data** | 16,000+ labeled Moroccan Darija/Arabizi samples |

---

## 🚀 Quick Inference (Transformers)

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="0khacha/darija-toxicity-classifier"
)

result = classifier("salam khouya")
print(result)
# Output: [{'label': 'SAFE', 'score': 0.9845}]
```

---

## 🧠 What Makes This Model Special?

### 🌍 Dialect-Aware
Built specifically for Moroccan linguistic patterns — not generic Arabic.

### 🔢 Arabizi Handling
Understands numeric character substitutions like:
* `in3al`
* `sa7a`
* `3likom`

### 🧹 Custom Preprocessing
The model was trained with specialized normalization:
* Lowercasing
* Removing dash/underscore splitting (`w-a-l-o` → `walo`)
* Fixing spaced characters (`n 3 a l` → `n3al`)
* Reducing elongation (`heeeey` → `hey`)
* Whitespace normalization

---

## 📊 Performance

| Metric | Score |
|--------|-------|
| **Accuracy** | ~94% |
| **F1-Score** | ~93% |
| **Inference Speed (GPU)** | ~50ms |

> **Note:** Performance may vary depending on hardware and deployment setup.

---

## 📖 Example Predictions

### Example 1: Safe Content

**Input:**
```python
"bghit nakol"
```

**Output:**
```python
Safe (98.45%)
```

### Example 2: Toxic Content

**Input:**
```python
"rak stupid"
```

**Output:**
```python
Toxic
```

---

## ⚠️ Limitations

* May struggle with extremely rare slang
* Context-dependent toxicity (sarcasm) may reduce accuracy
* Not intended for legal or automated moderation without human review

---

## 🔒 Dataset & Privacy

The training dataset is not publicly available for privacy and ethical reasons.

For research collaboration: 📩 [mohamedkhacha99@gmail.com](mailto:mohamedkhacha99@gmail.com)

---

## 📜 License

MIT License

---

## 🙏 Acknowledgments

* **DarijaBERT team** at SI2M-Lab
* **Hugging Face** Transformers ecosystem
* **PyTorch**
* The **Moroccan NLP community**

---

## 📚 Citation

If you use this model in your research, please cite:

```bibtex
@misc{darija-toxicity-classifier,
  author = {Khacha, Mohamed},
  title = {Darija Toxicity Classifier},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
}
```

---

## 🤝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check the [issues page](https://huggingface.co/0khacha/darija-toxicity-classifier/discussions).

---

**Made with ❤️ for the Moroccan NLP community**