File size: 5,291 Bytes
d6d057c 0c43291 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
license: apache-2.0
datasets:
- mopatik/setswana-offensive-977
language:
- tn
metrics:
- accuracy
- f1
- matthews_correlation
- recall
base_model:
- Davlan/afro-xlmr-base
pipeline_tag: text-classification
---
# Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection
## 1. Model Summary
This repository contains a fine-tuned version of **Afro-XLM-R**, a multilingual transformer model optimised for African languages.
The model has been fine-tuned to classify Setswana text into:
- **0 – Non-offensive**
- **1 – Offensive**
Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa.
Its cross-lingual capabilities make it particularly useful when dealing with:
- Code-switching
- Multilingual social media content
- Borrowed words from English/Setswana
---
## 2. Intended Use
### **Primary Use Cases**
- Detection of offensive, abusive, or harmful expressions in Setswana text.
- Digital forensic analysis of Facebook, WhatsApp, and other social media content.
- Research in low-resource NLP for African languages.
- Benchmarking multilingual vs monolingual transformer performance.
### **Not Intended For**
- Fully automated decision systems without human oversight.
- Legal conclusions or disciplinary outcomes without expert forensic interpretation.
- Non-Setswana text unless validated.
---
## 3. Dataset Description
A curated dataset of **977 Setswana social media text samples** was used.
### **Class Distribution**
- **Offensive:** 477
- **Non-offensive:** 500
### **Annotation Notes**
- Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang.
- Semantic triggers were used during training for improved sensitivity to Setswana insult constructions.
- The test split is **tag-free** to reflect real-world forensic environments.
### **Ethical Handling**
- All posts were sourced from publicly available content.
- Identifiable information was removed.
- This dataset is **not automatically redistributed** as part of the model.
---
## 4. Training Procedure
### **Model Architecture**
- Base model: **Afro-XLM-R**
- Backbone: XLM-RoBERTa
- Multilingual African-centric pretraining dataset
- ~270M parameters (depending on variant)
### **Training Hyperparameters**
- Epochs: **10**
- Batch size: **16 (training), 64 (evaluation)**
- Optimizer: **AdamW**
- Learning rate: **1e-5**
- Weight decay: **0.01**
- Loss function: **class-weighted cross entropy**
- Weights = `[1.0, 2.0]` (non-offensive, offensive)
### **Hardware**
- Trained using Google Colab GPU (T4/A100 depending on session).
---
## 5. Evaluation Methodology
The dataset split follows:
- **80% training**
- **20% held-out test set**
- 5-fold stratified cross-validation used during model selection.
- No semantic triggers or augmentations present in the test set.
Evaluation uses the following metrics:
- Accuracy
- Macro F1
- Recall for offensive class
- Matthews Correlation Coefficient (MCC)
- ROC-AUC
- Runtime speed
---
## 6. Test Set Results (Final Model)
| Metric | Value |
|--------|--------|
| **Accuracy** | 0.8622 |
| **Macro F1-score** | 0.8603 |
| **Recall (Offensive = 1)** | 0.8111 |
| **MCC** | 0.7229 |
| **ROC-AUC** | 0.9015 |
| **Loss** | 0.3895 |
| **Runtime (seconds)** | 1.1634 |
| **Samples per second** | 168.468 |
| **Steps per second** | 3.438 |
### Interpretation
- The **ROC-AUC of 0.90** demonstrates strong separation between offensive and non-offensive classes.
- **MCC = 0.7229** indicates strong classification reliability in mildly imbalanced data.
- **Recall(1) = 0.8111** means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly.
- Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space.
Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection.
---
## 7. How to Use the Model
### **Python Inference Example**
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "mopatik/Afro-XLM-R-offensive-detection-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Ensure model is in evaluation mode
model.eval()
# Sample text (replace with your actual text)
#sample_text = "o seso tota" # (you are insanely stupid) Example Setswana text
sample_text = "modimo a le segofatse" # (God bless you all) Example Setswana text
# Tokenize and prepare input
inputs = tokenizer(
sample_text,
padding='max_length',
truncation=True,
max_length=128,
return_tensors="pt"
)
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
predicted_class = torch.argmax(probs).item()
# Get class label and confidence
class_names = ["Non-offensive", "Offensive"]
confidence = probs[0][predicted_class].item()
print(f"Text: {sample_text}")
print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})")
print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}") |