File size: 5,291 Bytes

---
license: apache-2.0
datasets:
- mopatik/setswana-offensive-977
language:
- tn
metrics:
- accuracy
- f1
- matthews_correlation
- recall
base_model:
- Davlan/afro-xlmr-base
pipeline_tag: text-classification
---

# Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection

## 1. Model Summary
This repository contains a fine-tuned version of **Afro-XLM-R**, a multilingual transformer model optimised for African languages.  
The model has been fine-tuned to classify Setswana text into:

- **0 – Non-offensive**
- **1 – Offensive**

Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa.  
Its cross-lingual capabilities make it particularly useful when dealing with:  
- Code-switching  
- Multilingual social media content  
- Borrowed words from English/Setswana  

---

## 2. Intended Use

### **Primary Use Cases**
- Detection of offensive, abusive, or harmful expressions in Setswana text.
- Digital forensic analysis of Facebook, WhatsApp, and other social media content.
- Research in low-resource NLP for African languages.
- Benchmarking multilingual vs monolingual transformer performance.

### **Not Intended For**
- Fully automated decision systems without human oversight.
- Legal conclusions or disciplinary outcomes without expert forensic interpretation.
- Non-Setswana text unless validated.

---

## 3. Dataset Description

A curated dataset of **977 Setswana social media text samples** was used.

### **Class Distribution**
- **Offensive:** 477  
- **Non-offensive:** 500  

### **Annotation Notes**
- Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang.
- Semantic triggers were used during training for improved sensitivity to Setswana insult constructions.
- The test split is **tag-free** to reflect real-world forensic environments.

### **Ethical Handling**
- All posts were sourced from publicly available content.
- Identifiable information was removed.
- This dataset is **not automatically redistributed** as part of the model.

---

## 4. Training Procedure

### **Model Architecture**
- Base model: **Afro-XLM-R**  
- Backbone: XLM-RoBERTa  
- Multilingual African-centric pretraining dataset  
- ~270M parameters (depending on variant)

### **Training Hyperparameters**
- Epochs: **10**  
- Batch size: **16 (training), 64 (evaluation)**  
- Optimizer: **AdamW**  
- Learning rate: **1e-5**  
- Weight decay: **0.01**  
- Loss function: **class-weighted cross entropy**  
  - Weights = `[1.0, 2.0]` (non-offensive, offensive)

### **Hardware**
- Trained using Google Colab GPU (T4/A100 depending on session).

---

## 5. Evaluation Methodology

The dataset split follows:

- **80% training**  
- **20% held-out test set**  
- 5-fold stratified cross-validation used during model selection.  
- No semantic triggers or augmentations present in the test set.

Evaluation uses the following metrics:

- Accuracy  
- Macro F1  
- Recall for offensive class  
- Matthews Correlation Coefficient (MCC)  
- ROC-AUC  
- Runtime speed  

---

## 6. Test Set Results (Final Model)

| Metric | Value |
|--------|--------|
| **Accuracy** | 0.8622 |
| **Macro F1-score** | 0.8603 |
| **Recall (Offensive = 1)** | 0.8111 |
| **MCC** | 0.7229 |
| **ROC-AUC** | 0.9015 |
| **Loss** | 0.3895 |
| **Runtime (seconds)** | 1.1634 |
| **Samples per second** | 168.468 |
| **Steps per second** | 3.438 |

### Interpretation
- The **ROC-AUC of 0.90** demonstrates strong separation between offensive and non-offensive classes.  
- **MCC = 0.7229** indicates strong classification reliability in mildly imbalanced data.  
- **Recall(1) = 0.8111** means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly.  
- Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space.

Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection.

---

## 7. How to Use the Model

### **Python Inference Example**
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "mopatik/Afro-XLM-R-offensive-detection-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Ensure model is in evaluation mode
model.eval()

# Sample text (replace with your actual text)
#sample_text = "o seso tota"  # (you are insanely stupid) Example Setswana text
sample_text = "modimo a le segofatse"  # (God bless you all) Example Setswana text

# Tokenize and prepare input
inputs = tokenizer(
    sample_text,
    padding='max_length',
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    predicted_class = torch.argmax(probs).item()

# Get class label and confidence
class_names = ["Non-offensive", "Offensive"]
confidence = probs[0][predicted_class].item()

print(f"Text: {sample_text}")
print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})")
print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}")