File size: 3,464 Bytes
3281e98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91c6912
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: apache-2.0
language:
- ar
- fr
- en
metrics:
- accuracy
- f1
base_model:
- SI2M-Lab/DarijaBERT-arabizi
pipeline_tag: text-classification
tags:
- darija
- arabizi
- morocco
- bert
---


# Darija Toxicity Classifier πŸ‡²πŸ‡¦

A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.

This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:
* `3` β†’ ΨΉ
* `7` β†’ Ψ­
* `9` β†’ Ω‚

It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.

---

## πŸ“Œ Model Overview

| Property | Value |
|----------|-------|
| **Model ID** | `0khacha/darija-toxicity-classifier` |
| **Architecture** | Fine-tuned from `SI2M-Lab/DarijaBERT-arabizi` |
| **Task** | Binary Sequence Classification (Safe / Toxic) |
| **Framework** | Hugging Face Transformers |
| **Training Data** | 16,000+ labeled Moroccan Darija/Arabizi samples |

---

## πŸš€ Quick Inference (Transformers)

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="0khacha/darija-toxicity-classifier"
)

result = classifier("salam khouya")
print(result)
# Output: [{'label': 'SAFE', 'score': 0.9845}]
```

---

## 🧠 What Makes This Model Special?

### 🌍 Dialect-Aware
Built specifically for Moroccan linguistic patterns β€” not generic Arabic.

### πŸ”’ Arabizi Handling
Understands numeric character substitutions like:
* `in3al`
* `sa7a`
* `3likom`

### 🧹 Custom Preprocessing
The model was trained with specialized normalization:
* Lowercasing
* Removing dash/underscore splitting (`w-a-l-o` β†’ `walo`)
* Fixing spaced characters (`n 3 a l` β†’ `n3al`)
* Reducing elongation (`heeeey` β†’ `hey`)
* Whitespace normalization

---

## πŸ“Š Performance

| Metric | Score |
|--------|-------|
| **Accuracy** | ~94% |
| **F1-Score** | ~93% |
| **Inference Speed (GPU)** | ~50ms |

> **Note:** Performance may vary depending on hardware and deployment setup.

---

## πŸ“– Example Predictions

### Example 1: Safe Content

**Input:**
```python
"bghit nakol"
```

**Output:**
```python
Safe (98.45%)
```

### Example 2: Toxic Content

**Input:**
```python
"rak stupid"
```

**Output:**
```python
Toxic
```

---

## ⚠️ Limitations

* May struggle with extremely rare slang
* Context-dependent toxicity (sarcasm) may reduce accuracy
* Not intended for legal or automated moderation without human review

---

## πŸ”’ Dataset & Privacy

The training dataset is not publicly available for privacy and ethical reasons.

For research collaboration: πŸ“© [mohamedkhacha99@gmail.com](mailto:mohamedkhacha99@gmail.com)

---

## πŸ“œ License

MIT License

---

## πŸ™ Acknowledgments

* **DarijaBERT team** at SI2M-Lab
* **Hugging Face** Transformers ecosystem
* **PyTorch**
* The **Moroccan NLP community**

---

## πŸ“š Citation

If you use this model in your research, please cite:

```bibtex
@misc{darija-toxicity-classifier,
  author = {Khacha, Mohamed},
  title = {Darija Toxicity Classifier},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
}
```

---

## 🀝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check the [issues page](https://huggingface.co/0khacha/darija-toxicity-classifier/discussions).

---

**Made with ❀️ for the Moroccan NLP community**