nuha / README.md

docs: add model card

f648a97 unverified 14 days ago

4.38 kB

	---
	language:
	- ar
	license: apache-2.0
	base_model: thejosango/nuha-mlm
	tags:
	- bert
	- text-classification
	- hate-speech
	- gender-based-violence
	- arabic
	- multiclass-classification
	- onnx
	- pilot
	datasets:
	- thejosango/nuha-dataset
	metrics:
	- f1
	- precision
	- recall
	model-index:
	- name: nuha
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: Jordanian NUHA Dataset
	type: thejosango/nuha-dataset
	config: methodology
	split: validation
	metrics:
	- type: f1
	value: 0.5363
	name: F1
	- type: precision
	value: 0.6660
	name: Precision
	- type: recall
	value: 0.5188
	name: Recall
	---

	# nuha

	## Model Summary

	`nuha` is a lightweight, ONNX-optimised Arabic text classifier that categorises Jordanian social media comments into three classes based on the NUHA methodology for online gender-based violence (OGBV). It fine-tunes [`nuha-mlm`](https://huggingface.co/thejosango/nuha-mlm) — a domain-adapted Arabic BERT — with a reduced 4-layer architecture for efficient CPU inference, and is exported to ONNX. It shares the same classification task and labels as [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass) but is optimised for production deployment. This is the model powering the NUHA analysis platform.

	\| Label \| Meaning \|
	\|---\|---\|
	\| `Not Online Violence` \| Comments that are not hate speech \|
	\| `Offensive Language` \| Hate speech characterised by irony or sarcasm \|
	\| `Gender Based Violence` \| Direct hate speech targeting gender — the primary focus of NUHA \|

	This model was developed as part of a pilot proof-of-concept for the NUHA project by the [Jordan Open Source Association (JOSA)](https://josa.ngo).

	For the full-depth (12-layer) version of this classifier, see [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass).

	## Uses

	### Direct Use

	```python
	from optimum.onnxruntime import ORTModelForSequenceClassification
	from transformers import AutoTokenizer, pipeline

	model = ORTModelForSequenceClassification.from_pretrained("thejosango/nuha")
	tokenizer = AutoTokenizer.from_pretrained("thejosango/nuha")
	classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

	result = classifier("اخرسي يا غبية")
	print(result)
	# [{'label': 'Gender Based Violence', 'score': ...}]
	```

	For batch inference:

	```python
	comments = ["يعطيكم العافية", "أنتِ ساحرة", "اخرسي يا غبية"]
	results = classifier(comments)
	for comment, result in zip(comments, results):
	print(f"{result['label']} ({result['score']:.2f}): {comment}")
	```

	### Using the PyTorch Version

	If you need the full PyTorch model (for fine-tuning or non-ONNX inference), use [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass) directly.

	### Out-of-Scope Use

	- Other Arabic dialects: The model was trained primarily on Jordanian Arabic. Performance on Egyptian, Gulf, or Modern Standard Arabic is not validated.
	- Other hate speech targets: NUHA is calibrated for online gender-based violence. It is not designed to detect hate speech targeting race, religion, or other demographics.
	- High-stakes automated decisions: Given the moderate performance (F1 ≈ 0.54) and pilot nature of this work, the model should not be used as the sole decision-maker in content moderation systems without human review.

	## Preprocessing

	At inference time, apply the following normalisation to input text before passing it to the model:

	1. URLs replaced with `[رابط]` token
	2. @mentions replaced with `[مستخدم]` token
	3. Email addresses replaced with `[بريد]` token
	4. Numbers removed
	5. Punctuation removed
	6. Arabic diacritics (harakat) removed
	7. Whitespace normalised

	## Evaluation Results

	Evaluated on the validation split of [`thejosango/nuha-dataset`](https://huggingface.co/datasets/thejosango/nuha-dataset) (methodology configuration):

	\| Metric \| Value \|
	\|---\|---\|
	\| F1 (macro) \| 0.5363 \|
	\| Precision \| 0.6660 \|
	\| Recall \| 0.5188 \|

	See [`nuha-multiclass`](https://huggingface.co/thejosango/nuha-multiclass) for full training details and evaluation discussion.

	---

	This model was developed as part of an initial pilot study. Performance metrics reflect the complexity of the task and the proof-of-concept nature of this system.