|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- mopatik/setswana-offensive-977 |
|
|
language: |
|
|
- tn |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- matthews_correlation |
|
|
- recall |
|
|
base_model: |
|
|
- Davlan/afro-xlmr-base |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection |
|
|
|
|
|
## 1. Model Summary |
|
|
This repository contains a fine-tuned version of **Afro-XLM-R**, a multilingual transformer model optimised for African languages. |
|
|
The model has been fine-tuned to classify Setswana text into: |
|
|
|
|
|
- **0 – Non-offensive** |
|
|
- **1 – Offensive** |
|
|
|
|
|
Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa. |
|
|
Its cross-lingual capabilities make it particularly useful when dealing with: |
|
|
- Code-switching |
|
|
- Multilingual social media content |
|
|
- Borrowed words from English/Setswana |
|
|
|
|
|
--- |
|
|
|
|
|
## 2. Intended Use |
|
|
|
|
|
### **Primary Use Cases** |
|
|
- Detection of offensive, abusive, or harmful expressions in Setswana text. |
|
|
- Digital forensic analysis of Facebook, WhatsApp, and other social media content. |
|
|
- Research in low-resource NLP for African languages. |
|
|
- Benchmarking multilingual vs monolingual transformer performance. |
|
|
|
|
|
### **Not Intended For** |
|
|
- Fully automated decision systems without human oversight. |
|
|
- Legal conclusions or disciplinary outcomes without expert forensic interpretation. |
|
|
- Non-Setswana text unless validated. |
|
|
|
|
|
--- |
|
|
|
|
|
## 3. Dataset Description |
|
|
|
|
|
A curated dataset of **977 Setswana social media text samples** was used. |
|
|
|
|
|
### **Class Distribution** |
|
|
- **Offensive:** 477 |
|
|
- **Non-offensive:** 500 |
|
|
|
|
|
### **Annotation Notes** |
|
|
- Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang. |
|
|
- Semantic triggers were used during training for improved sensitivity to Setswana insult constructions. |
|
|
- The test split is **tag-free** to reflect real-world forensic environments. |
|
|
|
|
|
### **Ethical Handling** |
|
|
- All posts were sourced from publicly available content. |
|
|
- Identifiable information was removed. |
|
|
- This dataset is **not automatically redistributed** as part of the model. |
|
|
|
|
|
--- |
|
|
|
|
|
## 4. Training Procedure |
|
|
|
|
|
### **Model Architecture** |
|
|
- Base model: **Afro-XLM-R** |
|
|
- Backbone: XLM-RoBERTa |
|
|
- Multilingual African-centric pretraining dataset |
|
|
- ~270M parameters (depending on variant) |
|
|
|
|
|
### **Training Hyperparameters** |
|
|
- Epochs: **10** |
|
|
- Batch size: **16 (training), 64 (evaluation)** |
|
|
- Optimizer: **AdamW** |
|
|
- Learning rate: **1e-5** |
|
|
- Weight decay: **0.01** |
|
|
- Loss function: **class-weighted cross entropy** |
|
|
- Weights = `[1.0, 2.0]` (non-offensive, offensive) |
|
|
|
|
|
### **Hardware** |
|
|
- Trained using Google Colab GPU (T4/A100 depending on session). |
|
|
|
|
|
--- |
|
|
|
|
|
## 5. Evaluation Methodology |
|
|
|
|
|
The dataset split follows: |
|
|
|
|
|
- **80% training** |
|
|
- **20% held-out test set** |
|
|
- 5-fold stratified cross-validation used during model selection. |
|
|
- No semantic triggers or augmentations present in the test set. |
|
|
|
|
|
Evaluation uses the following metrics: |
|
|
|
|
|
- Accuracy |
|
|
- Macro F1 |
|
|
- Recall for offensive class |
|
|
- Matthews Correlation Coefficient (MCC) |
|
|
- ROC-AUC |
|
|
- Runtime speed |
|
|
|
|
|
--- |
|
|
|
|
|
## 6. Test Set Results (Final Model) |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|--------| |
|
|
| **Accuracy** | 0.8622 | |
|
|
| **Macro F1-score** | 0.8603 | |
|
|
| **Recall (Offensive = 1)** | 0.8111 | |
|
|
| **MCC** | 0.7229 | |
|
|
| **ROC-AUC** | 0.9015 | |
|
|
| **Loss** | 0.3895 | |
|
|
| **Runtime (seconds)** | 1.1634 | |
|
|
| **Samples per second** | 168.468 | |
|
|
| **Steps per second** | 3.438 | |
|
|
|
|
|
### Interpretation |
|
|
- The **ROC-AUC of 0.90** demonstrates strong separation between offensive and non-offensive classes. |
|
|
- **MCC = 0.7229** indicates strong classification reliability in mildly imbalanced data. |
|
|
- **Recall(1) = 0.8111** means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly. |
|
|
- Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space. |
|
|
|
|
|
Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection. |
|
|
|
|
|
--- |
|
|
|
|
|
## 7. How to Use the Model |
|
|
|
|
|
### **Python Inference Example** |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_name = "mopatik/Afro-XLM-R-offensive-detection-v1" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Ensure model is in evaluation mode |
|
|
model.eval() |
|
|
|
|
|
# Sample text (replace with your actual text) |
|
|
#sample_text = "o seso tota" # (you are insanely stupid) Example Setswana text |
|
|
sample_text = "modimo a le segofatse" # (God bless you all) Example Setswana text |
|
|
|
|
|
# Tokenize and prepare input |
|
|
inputs = tokenizer( |
|
|
sample_text, |
|
|
padding='max_length', |
|
|
truncation=True, |
|
|
max_length=128, |
|
|
return_tensors="pt" |
|
|
) |
|
|
|
|
|
# Make prediction |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=1) |
|
|
predicted_class = torch.argmax(probs).item() |
|
|
|
|
|
# Get class label and confidence |
|
|
class_names = ["Non-offensive", "Offensive"] |
|
|
confidence = probs[0][predicted_class].item() |
|
|
|
|
|
print(f"Text: {sample_text}") |
|
|
print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})") |
|
|
print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}") |