Afro-XLM-R-offensive-detection-v1 / README.md

Update README.md

0c43291 verified 2 months ago

5.29 kB

	---
	license: apache-2.0
	datasets:
	- mopatik/setswana-offensive-977
	language:
	- tn
	metrics:
	- accuracy
	- f1
	- matthews_correlation
	- recall
	base_model:
	- Davlan/afro-xlmr-base
	pipeline_tag: text-classification
	---

	# Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection

	## 1. Model Summary
	This repository contains a fine-tuned version of Afro-XLM-R, a multilingual transformer model optimised for African languages.
	The model has been fine-tuned to classify Setswana text into:

	- 0 – Non-offensive
	- 1 – Offensive

	Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa.
	Its cross-lingual capabilities make it particularly useful when dealing with:
	- Code-switching
	- Multilingual social media content
	- Borrowed words from English/Setswana

	---

	## 2. Intended Use

	### Primary Use Cases
	- Detection of offensive, abusive, or harmful expressions in Setswana text.
	- Digital forensic analysis of Facebook, WhatsApp, and other social media content.
	- Research in low-resource NLP for African languages.
	- Benchmarking multilingual vs monolingual transformer performance.

	### Not Intended For
	- Fully automated decision systems without human oversight.
	- Legal conclusions or disciplinary outcomes without expert forensic interpretation.
	- Non-Setswana text unless validated.

	---

	## 3. Dataset Description

	A curated dataset of 977 Setswana social media text samples was used.

	### Class Distribution
	- Offensive: 477
	- Non-offensive: 500

	### Annotation Notes
	- Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang.
	- Semantic triggers were used during training for improved sensitivity to Setswana insult constructions.
	- The test split is tag-free to reflect real-world forensic environments.

	### Ethical Handling
	- All posts were sourced from publicly available content.
	- Identifiable information was removed.
	- This dataset is not automatically redistributed as part of the model.

	---

	## 4. Training Procedure

	### Model Architecture
	- Base model: Afro-XLM-R
	- Backbone: XLM-RoBERTa
	- Multilingual African-centric pretraining dataset
	- ~270M parameters (depending on variant)

	### Training Hyperparameters
	- Epochs: 10
	- Batch size: 16 (training), 64 (evaluation)
	- Optimizer: AdamW
	- Learning rate: 1e-5
	- Weight decay: 0.01
	- Loss function: class-weighted cross entropy
	- Weights = `[1.0, 2.0]` (non-offensive, offensive)

	### Hardware
	- Trained using Google Colab GPU (T4/A100 depending on session).

	---

	## 5. Evaluation Methodology

	The dataset split follows:

	- 80% training
	- 20% held-out test set
	- 5-fold stratified cross-validation used during model selection.
	- No semantic triggers or augmentations present in the test set.

	Evaluation uses the following metrics:

	- Accuracy
	- Macro F1
	- Recall for offensive class
	- Matthews Correlation Coefficient (MCC)
	- ROC-AUC
	- Runtime speed

	---

	## 6. Test Set Results (Final Model)

	\| Metric \| Value \|
	\|--------\|--------\|
	\| Accuracy \| 0.8622 \|
	\| Macro F1-score \| 0.8603 \|
	\| Recall (Offensive = 1) \| 0.8111 \|
	\| MCC \| 0.7229 \|
	\| ROC-AUC \| 0.9015 \|
	\| Loss \| 0.3895 \|
	\| Runtime (seconds) \| 1.1634 \|
	\| Samples per second \| 168.468 \|
	\| Steps per second \| 3.438 \|

	### Interpretation
	- The ROC-AUC of 0.90 demonstrates strong separation between offensive and non-offensive classes.
	- MCC = 0.7229 indicates strong classification reliability in mildly imbalanced data.
	- Recall(1) = 0.8111 means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly.
	- Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space.

	Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection.

	---

	## 7. How to Use the Model

	### Python Inference Example
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "mopatik/Afro-XLM-R-offensive-detection-v1"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Ensure model is in evaluation mode
	model.eval()

	# Sample text (replace with your actual text)
	#sample_text = "o seso tota" # (you are insanely stupid) Example Setswana text
	sample_text = "modimo a le segofatse" # (God bless you all) Example Setswana text

	# Tokenize and prepare input
	inputs = tokenizer(
	sample_text,
	padding='max_length',
	truncation=True,
	max_length=128,
	return_tensors="pt"
	)

	# Make prediction
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=1)
	predicted_class = torch.argmax(probs).item()

	# Get class label and confidence
	class_names = ["Non-offensive", "Offensive"]
	confidence = probs[0][predicted_class].item()

	print(f"Text: {sample_text}")
	print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})")
	print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}")