--- license: apache-2.0 datasets: - mopatik/setswana-offensive-977 language: - tn metrics: - accuracy - f1 - matthews_correlation - recall base_model: - Davlan/afro-xlmr-base pipeline_tag: text-classification --- # Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection ## 1. Model Summary This repository contains a fine-tuned version of **Afro-XLM-R**, a multilingual transformer model optimised for African languages. The model has been fine-tuned to classify Setswana text into: - **0 – Non-offensive** - **1 – Offensive** Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa. Its cross-lingual capabilities make it particularly useful when dealing with: - Code-switching - Multilingual social media content - Borrowed words from English/Setswana --- ## 2. Intended Use ### **Primary Use Cases** - Detection of offensive, abusive, or harmful expressions in Setswana text. - Digital forensic analysis of Facebook, WhatsApp, and other social media content. - Research in low-resource NLP for African languages. - Benchmarking multilingual vs monolingual transformer performance. ### **Not Intended For** - Fully automated decision systems without human oversight. - Legal conclusions or disciplinary outcomes without expert forensic interpretation. - Non-Setswana text unless validated. --- ## 3. Dataset Description A curated dataset of **977 Setswana social media text samples** was used. ### **Class Distribution** - **Offensive:** 477 - **Non-offensive:** 500 ### **Annotation Notes** - Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang. - Semantic triggers were used during training for improved sensitivity to Setswana insult constructions. - The test split is **tag-free** to reflect real-world forensic environments. ### **Ethical Handling** - All posts were sourced from publicly available content. - Identifiable information was removed. - This dataset is **not automatically redistributed** as part of the model. --- ## 4. Training Procedure ### **Model Architecture** - Base model: **Afro-XLM-R** - Backbone: XLM-RoBERTa - Multilingual African-centric pretraining dataset - ~270M parameters (depending on variant) ### **Training Hyperparameters** - Epochs: **10** - Batch size: **16 (training), 64 (evaluation)** - Optimizer: **AdamW** - Learning rate: **1e-5** - Weight decay: **0.01** - Loss function: **class-weighted cross entropy** - Weights = `[1.0, 2.0]` (non-offensive, offensive) ### **Hardware** - Trained using Google Colab GPU (T4/A100 depending on session). --- ## 5. Evaluation Methodology The dataset split follows: - **80% training** - **20% held-out test set** - 5-fold stratified cross-validation used during model selection. - No semantic triggers or augmentations present in the test set. Evaluation uses the following metrics: - Accuracy - Macro F1 - Recall for offensive class - Matthews Correlation Coefficient (MCC) - ROC-AUC - Runtime speed --- ## 6. Test Set Results (Final Model) | Metric | Value | |--------|--------| | **Accuracy** | 0.8622 | | **Macro F1-score** | 0.8603 | | **Recall (Offensive = 1)** | 0.8111 | | **MCC** | 0.7229 | | **ROC-AUC** | 0.9015 | | **Loss** | 0.3895 | | **Runtime (seconds)** | 1.1634 | | **Samples per second** | 168.468 | | **Steps per second** | 3.438 | ### Interpretation - The **ROC-AUC of 0.90** demonstrates strong separation between offensive and non-offensive classes. - **MCC = 0.7229** indicates strong classification reliability in mildly imbalanced data. - **Recall(1) = 0.8111** means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly. - Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space. Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection. --- ## 7. How to Use the Model ### **Python Inference Example** ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "mopatik/Afro-XLM-R-offensive-detection-v1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Ensure model is in evaluation mode model.eval() # Sample text (replace with your actual text) #sample_text = "o seso tota" # (you are insanely stupid) Example Setswana text sample_text = "modimo a le segofatse" # (God bless you all) Example Setswana text # Tokenize and prepare input inputs = tokenizer( sample_text, padding='max_length', truncation=True, max_length=128, return_tensors="pt" ) # Make prediction with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=1) predicted_class = torch.argmax(probs).item() # Get class label and confidence class_names = ["Non-offensive", "Offensive"] confidence = probs[0][predicted_class].item() print(f"Text: {sample_text}") print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})") print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}")