mopatik
/

Afro-XLM-R-offensive-detection-v1

Text Classification

Tswana

Model card Files Files and versions

xet

Community

mopatik commited on Dec 4, 2025

Commit

d6d057c

verified ·

1 Parent(s): 6edf694

Create README.md

Browse files

Files changed (1) hide show

README.md +153 -0

README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+---
+license: apache-2.0
+datasets:
+- mopatik/setswana-offensive-977
+language:
+- tn
+metrics:
+- accuracy
+- f1
+- matthews_correlation
+- recall
+base_model:
+- Davlan/afro-xlmr-base
+pipeline_tag: text-classification
+---
+# Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection
+## 1. Model Summary
+This repository contains a fine-tuned version of **Afro-XLM-R**, a multilingual transformer model optimised for African languages.
+The model has been fine-tuned to classify Setswana text into:
+- **0 – Non-offensive**
+- **1 – Offensive**
+Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa.
+Its cross-lingual capabilities make it particularly useful when dealing with:
+- Code-switching
+- Multilingual social media content
+- Borrowed words from English/Setswana
+---
+## 2. Intended Use
+### **Primary Use Cases**
+- Detection of offensive, abusive, or harmful expressions in Setswana text.
+- Digital forensic analysis of Facebook, WhatsApp, and other social media content.
+- Research in low-resource NLP for African languages.
+- Benchmarking multilingual vs monolingual transformer performance.
+### **Not Intended For**
+- Fully automated decision systems without human oversight.
+- Legal conclusions or disciplinary outcomes without expert forensic interpretation.
+- Non-Setswana text unless validated.
+---
+## 3. Dataset Description
+A curated dataset of **977 Setswana social media text samples** was used.
+### **Class Distribution**
+- **Offensive:** 477
+- **Non-offensive:** 500
+### **Annotation Notes**
+- Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang.
+- Semantic triggers were used during training for improved sensitivity to Setswana insult constructions.
+- The test split is **tag-free** to reflect real-world forensic environments.
+### **Ethical Handling**
+- All posts were sourced from publicly available content.
+- Identifiable information was removed.
+- This dataset is **not automatically redistributed** as part of the model.
+---
+## 4. Training Procedure
+### **Model Architecture**
+- Base model: **Afro-XLM-R**
+- Backbone: XLM-RoBERTa
+- Multilingual African-centric pretraining dataset
+- ~270M parameters (depending on variant)
+### **Training Hyperparameters**
+- Epochs: **10**
+- Batch size: **16 (training), 64 (evaluation)**
+- Optimizer: **AdamW**
+- Learning rate: **1e-5**
+- Weight decay: **0.01**
+- Loss function: **class-weighted cross entropy**
+  - Weights = `[1.0, 2.0]` (non-offensive, offensive)
+### **Hardware**
+- Trained using Google Colab GPU (T4/A100 depending on session).
+---
+## 5. Evaluation Methodology
+The dataset split follows:
+- **80% training**
+- **20% held-out test set**
+- 5-fold stratified cross-validation used during model selection.
+- No semantic triggers or augmentations present in the test set.
+Evaluation uses the following metrics:
+- Accuracy
+- Macro F1
+- Recall for offensive class
+- Matthews Correlation Coefficient (MCC)
+- ROC-AUC
+- Runtime speed
+---
+## 6. Test Set Results (Final Model)
+| Metric | Value |
+|--------|--------|
+| **Accuracy** | 0.8622 |
+| **Macro F1-score** | 0.8603 |
+| **Recall (Offensive = 1)** | 0.8111 |
+| **MCC** | 0.7229 |
+| **ROC-AUC** | 0.9015 |
+| **Loss** | 0.3895 |
+| **Runtime (seconds)** | 1.1634 |
+| **Samples per second** | 168.468 |
+| **Steps per second** | 3.438 |
+### Interpretation
+- The **ROC-AUC of 0.90** demonstrates strong separation between offensive and non-offensive classes.
+- **MCC = 0.7229** indicates strong classification reliability in mildly imbalanced data.
+- **Recall(1) = 0.8111** means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly.
+- Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space.
+Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection.
+---
+## 7. How to Use the Model
+### **Python Inference Example**
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "mopatik/Afro-XLM-R-offensive-detection-v1"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+text = "O seso tota"
+inputs = tokenizer(text, return_tensors="pt")
+logits = model(**inputs).logits
+probs = torch.softmax(logits, dim=1)
+print("Probabilities:", probs)
+print("Predicted class:", torch.argmax(probs).item())