Text Classification
Transformers
PyTorch
ONNX
Arabic
bert
hate-speech
gender-based-violence
arabic
multiclass-classification
pilot
Eval Results (legacy)
text-embeddings-inference
Instructions to use thejosango/nuha with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thejosango/nuha with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="thejosango/nuha")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("thejosango/nuha") model = AutoModelForSequenceClassification.from_pretrained("thejosango/nuha") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - ar | |
| license: apache-2.0 | |
| base_model: thejosango/nuha-mlm | |
| tags: | |
| - bert | |
| - text-classification | |
| - hate-speech | |
| - gender-based-violence | |
| - arabic | |
| - multiclass-classification | |
| - onnx | |
| - pilot | |
| datasets: | |
| - thejosango/nuha-dataset | |
| metrics: | |
| - f1 | |
| - precision | |
| - recall | |
| model-index: | |
| - name: nuha | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Text Classification | |
| dataset: | |
| name: Jordanian NUHA Dataset | |
| type: thejosango/nuha-dataset | |
| config: methodology | |
| split: validation | |
| metrics: | |
| - type: f1 | |
| value: 0.5363 | |
| name: F1 | |
| - type: precision | |
| value: 0.6660 | |
| name: Precision | |
| - type: recall | |
| value: 0.5188 | |
| name: Recall | |
| # nuha | |
| ## Model Summary | |
| `nuha` is a **lightweight, ONNX-optimised** Arabic text classifier that categorises Jordanian social media comments into three classes based on the NUHA methodology for online gender-based violence (OGBV). It fine-tunes [`nuha-mlm`](https://huggingface.co/thejosango/nuha-mlm) — a domain-adapted Arabic BERT — with a reduced 4-layer architecture for efficient CPU inference, and is exported to ONNX. It shares the same classification task and labels as [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass) but is optimised for production deployment. This is the model powering the NUHA analysis platform. | |
| | Label | Meaning | | |
| |---|---| | |
| | `Not Online Violence` | Comments that are not hate speech | | |
| | `Offensive Language` | Hate speech characterised by irony or sarcasm | | |
| | `Gender Based Violence` | Direct hate speech targeting gender — the primary focus of NUHA | | |
| This model was developed as part of a **pilot proof-of-concept** for the NUHA project by the [Jordan Open Source Association (JOSA)](https://josa.ngo). | |
| For the full-depth (12-layer) version of this classifier, see [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass). | |
| ## Uses | |
| ### Direct Use | |
| ```python | |
| from optimum.onnxruntime import ORTModelForSequenceClassification | |
| from transformers import AutoTokenizer, pipeline | |
| model = ORTModelForSequenceClassification.from_pretrained("thejosango/nuha") | |
| tokenizer = AutoTokenizer.from_pretrained("thejosango/nuha") | |
| classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) | |
| result = classifier("اخرسي يا غبية") | |
| print(result) | |
| # [{'label': 'Gender Based Violence', 'score': ...}] | |
| ``` | |
| For batch inference: | |
| ```python | |
| comments = ["يعطيكم العافية", "أنتِ ساحرة", "اخرسي يا غبية"] | |
| results = classifier(comments) | |
| for comment, result in zip(comments, results): | |
| print(f"{result['label']} ({result['score']:.2f}): {comment}") | |
| ``` | |
| ### Using the PyTorch Version | |
| If you need the full PyTorch model (for fine-tuning or non-ONNX inference), use [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass) directly. | |
| ### Out-of-Scope Use | |
| - **Other Arabic dialects**: The model was trained primarily on Jordanian Arabic. Performance on Egyptian, Gulf, or Modern Standard Arabic is not validated. | |
| - **Other hate speech targets**: NUHA is calibrated for online gender-based violence. It is not designed to detect hate speech targeting race, religion, or other demographics. | |
| - **High-stakes automated decisions**: Given the moderate performance (F1 ≈ 0.54) and pilot nature of this work, the model should not be used as the sole decision-maker in content moderation systems without human review. | |
| ## Preprocessing | |
| At inference time, apply the following normalisation to input text before passing it to the model: | |
| 1. URLs replaced with `[رابط]` token | |
| 2. @mentions replaced with `[مستخدم]` token | |
| 3. Email addresses replaced with `[بريد]` token | |
| 4. Numbers removed | |
| 5. Punctuation removed | |
| 6. Arabic diacritics (harakat) removed | |
| 7. Whitespace normalised | |
| ## Evaluation Results | |
| Evaluated on the validation split of [`thejosango/nuha-dataset`](https://huggingface.co/datasets/thejosango/nuha-dataset) (methodology configuration): | |
| | Metric | Value | | |
| |---|---| | |
| | F1 (macro) | 0.5363 | | |
| | Precision | 0.6660 | | |
| | Recall | 0.5188 | | |
| See [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass) for full training details and evaluation discussion. | |
| --- | |
| *This model was developed as part of an initial pilot study. Performance metrics reflect the complexity of the task and the proof-of-concept nature of this system.* | |