File size: 5,291 Bytes
d6d057c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c43291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
license: apache-2.0
datasets:
- mopatik/setswana-offensive-977
language:
- tn
metrics:
- accuracy
- f1
- matthews_correlation
- recall
base_model:
- Davlan/afro-xlmr-base
pipeline_tag: text-classification
---

# Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection

## 1. Model Summary
This repository contains a fine-tuned version of **Afro-XLM-R**, a multilingual transformer model optimised for African languages.  
The model has been fine-tuned to classify Setswana text into:

- **0 – Non-offensive**
- **1 – Offensive**

Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa.  
Its cross-lingual capabilities make it particularly useful when dealing with:  
- Code-switching  
- Multilingual social media content  
- Borrowed words from English/Setswana  

---

## 2. Intended Use

### **Primary Use Cases**
- Detection of offensive, abusive, or harmful expressions in Setswana text.
- Digital forensic analysis of Facebook, WhatsApp, and other social media content.
- Research in low-resource NLP for African languages.
- Benchmarking multilingual vs monolingual transformer performance.

### **Not Intended For**
- Fully automated decision systems without human oversight.
- Legal conclusions or disciplinary outcomes without expert forensic interpretation.
- Non-Setswana text unless validated.

---

## 3. Dataset Description

A curated dataset of **977 Setswana social media text samples** was used.

### **Class Distribution**
- **Offensive:** 477  
- **Non-offensive:** 500  

### **Annotation Notes**
- Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang.
- Semantic triggers were used during training for improved sensitivity to Setswana insult constructions.
- The test split is **tag-free** to reflect real-world forensic environments.

### **Ethical Handling**
- All posts were sourced from publicly available content.
- Identifiable information was removed.
- This dataset is **not automatically redistributed** as part of the model.

---

## 4. Training Procedure

### **Model Architecture**
- Base model: **Afro-XLM-R**  
- Backbone: XLM-RoBERTa  
- Multilingual African-centric pretraining dataset  
- ~270M parameters (depending on variant)

### **Training Hyperparameters**
- Epochs: **10**  
- Batch size: **16 (training), 64 (evaluation)**  
- Optimizer: **AdamW**  
- Learning rate: **1e-5**  
- Weight decay: **0.01**  
- Loss function: **class-weighted cross entropy**  
  - Weights = `[1.0, 2.0]` (non-offensive, offensive)

### **Hardware**
- Trained using Google Colab GPU (T4/A100 depending on session).

---

## 5. Evaluation Methodology

The dataset split follows:

- **80% training**  
- **20% held-out test set**  
- 5-fold stratified cross-validation used during model selection.  
- No semantic triggers or augmentations present in the test set.

Evaluation uses the following metrics:

- Accuracy  
- Macro F1  
- Recall for offensive class  
- Matthews Correlation Coefficient (MCC)  
- ROC-AUC  
- Runtime speed  

---

## 6. Test Set Results (Final Model)

| Metric | Value |
|--------|--------|
| **Accuracy** | 0.8622 |
| **Macro F1-score** | 0.8603 |
| **Recall (Offensive = 1)** | 0.8111 |
| **MCC** | 0.7229 |
| **ROC-AUC** | 0.9015 |
| **Loss** | 0.3895 |
| **Runtime (seconds)** | 1.1634 |
| **Samples per second** | 168.468 |
| **Steps per second** | 3.438 |

### Interpretation
- The **ROC-AUC of 0.90** demonstrates strong separation between offensive and non-offensive classes.  
- **MCC = 0.7229** indicates strong classification reliability in mildly imbalanced data.  
- **Recall(1) = 0.8111** means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly.  
- Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space.

Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection.

---

## 7. How to Use the Model

### **Python Inference Example**
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "mopatik/Afro-XLM-R-offensive-detection-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Ensure model is in evaluation mode
model.eval()

# Sample text (replace with your actual text)
#sample_text = "o seso tota"  # (you are insanely stupid) Example Setswana text
sample_text = "modimo a le segofatse"  # (God bless you all) Example Setswana text

# Tokenize and prepare input
inputs = tokenizer(
    sample_text,
    padding='max_length',
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    predicted_class = torch.argmax(probs).item()

# Get class label and confidence
class_names = ["Non-offensive", "Offensive"]
confidence = probs[0][predicted_class].item()

print(f"Text: {sample_text}")
print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})")
print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}")