File size: 3,464 Bytes
3281e98 91c6912 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
license: apache-2.0
language:
- ar
- fr
- en
metrics:
- accuracy
- f1
base_model:
- SI2M-Lab/DarijaBERT-arabizi
pipeline_tag: text-classification
tags:
- darija
- arabizi
- morocco
- bert
---
# Darija Toxicity Classifier π²π¦
A transformer-based NLP model for detecting toxic content in Moroccan Darija and Arabizi.
This model is specifically designed to handle the linguistic complexity of Moroccan dialect, including Arabizi (Arabic written in Latin characters with numbers) such as:
* `3` β ΨΉ
* `7` β Ψ
* `9` β Ω
It also supports code-switched text mixing Darija, Arabic, French, English, and Tamazight.
---
## π Model Overview
| Property | Value |
|----------|-------|
| **Model ID** | `0khacha/darija-toxicity-classifier` |
| **Architecture** | Fine-tuned from `SI2M-Lab/DarijaBERT-arabizi` |
| **Task** | Binary Sequence Classification (Safe / Toxic) |
| **Framework** | Hugging Face Transformers |
| **Training Data** | 16,000+ labeled Moroccan Darija/Arabizi samples |
---
## π Quick Inference (Transformers)
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="0khacha/darija-toxicity-classifier"
)
result = classifier("salam khouya")
print(result)
# Output: [{'label': 'SAFE', 'score': 0.9845}]
```
---
## π§ What Makes This Model Special?
### π Dialect-Aware
Built specifically for Moroccan linguistic patterns β not generic Arabic.
### π’ Arabizi Handling
Understands numeric character substitutions like:
* `in3al`
* `sa7a`
* `3likom`
### π§Ή Custom Preprocessing
The model was trained with specialized normalization:
* Lowercasing
* Removing dash/underscore splitting (`w-a-l-o` β `walo`)
* Fixing spaced characters (`n 3 a l` β `n3al`)
* Reducing elongation (`heeeey` β `hey`)
* Whitespace normalization
---
## π Performance
| Metric | Score |
|--------|-------|
| **Accuracy** | ~94% |
| **F1-Score** | ~93% |
| **Inference Speed (GPU)** | ~50ms |
> **Note:** Performance may vary depending on hardware and deployment setup.
---
## π Example Predictions
### Example 1: Safe Content
**Input:**
```python
"bghit nakol"
```
**Output:**
```python
Safe (98.45%)
```
### Example 2: Toxic Content
**Input:**
```python
"rak stupid"
```
**Output:**
```python
Toxic
```
---
## β οΈ Limitations
* May struggle with extremely rare slang
* Context-dependent toxicity (sarcasm) may reduce accuracy
* Not intended for legal or automated moderation without human review
---
## π Dataset & Privacy
The training dataset is not publicly available for privacy and ethical reasons.
For research collaboration: π© [mohamedkhacha99@gmail.com](mailto:mohamedkhacha99@gmail.com)
---
## π License
MIT License
---
## π Acknowledgments
* **DarijaBERT team** at SI2M-Lab
* **Hugging Face** Transformers ecosystem
* **PyTorch**
* The **Moroccan NLP community**
---
## π Citation
If you use this model in your research, please cite:
```bibtex
@misc{darija-toxicity-classifier,
author = {Khacha, Mohamed},
title = {Darija Toxicity Classifier},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/0khacha/darija-toxicity-classifier}
}
```
---
## π€ Contributing
Contributions, issues, and feature requests are welcome!
Feel free to check the [issues page](https://huggingface.co/0khacha/darija-toxicity-classifier/discussions).
---
**Made with β€οΈ for the Moroccan NLP community** |