mopatik commited on
Commit
d6d057c
·
verified ·
1 Parent(s): 6edf694

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - mopatik/setswana-offensive-977
5
+ language:
6
+ - tn
7
+ metrics:
8
+ - accuracy
9
+ - f1
10
+ - matthews_correlation
11
+ - recall
12
+ base_model:
13
+ - Davlan/afro-xlmr-base
14
+ pipeline_tag: text-classification
15
+ ---
16
+
17
+ # Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection
18
+
19
+ ## 1. Model Summary
20
+ This repository contains a fine-tuned version of **Afro-XLM-R**, a multilingual transformer model optimised for African languages.
21
+ The model has been fine-tuned to classify Setswana text into:
22
+
23
+ - **0 – Non-offensive**
24
+ - **1 – Offensive**
25
+
26
+ Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa.
27
+ Its cross-lingual capabilities make it particularly useful when dealing with:
28
+ - Code-switching
29
+ - Multilingual social media content
30
+ - Borrowed words from English/Setswana
31
+
32
+ ---
33
+
34
+ ## 2. Intended Use
35
+
36
+ ### **Primary Use Cases**
37
+ - Detection of offensive, abusive, or harmful expressions in Setswana text.
38
+ - Digital forensic analysis of Facebook, WhatsApp, and other social media content.
39
+ - Research in low-resource NLP for African languages.
40
+ - Benchmarking multilingual vs monolingual transformer performance.
41
+
42
+ ### **Not Intended For**
43
+ - Fully automated decision systems without human oversight.
44
+ - Legal conclusions or disciplinary outcomes without expert forensic interpretation.
45
+ - Non-Setswana text unless validated.
46
+
47
+ ---
48
+
49
+ ## 3. Dataset Description
50
+
51
+ A curated dataset of **977 Setswana social media text samples** was used.
52
+
53
+ ### **Class Distribution**
54
+ - **Offensive:** 477
55
+ - **Non-offensive:** 500
56
+
57
+ ### **Annotation Notes**
58
+ - Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang.
59
+ - Semantic triggers were used during training for improved sensitivity to Setswana insult constructions.
60
+ - The test split is **tag-free** to reflect real-world forensic environments.
61
+
62
+ ### **Ethical Handling**
63
+ - All posts were sourced from publicly available content.
64
+ - Identifiable information was removed.
65
+ - This dataset is **not automatically redistributed** as part of the model.
66
+
67
+ ---
68
+
69
+ ## 4. Training Procedure
70
+
71
+ ### **Model Architecture**
72
+ - Base model: **Afro-XLM-R**
73
+ - Backbone: XLM-RoBERTa
74
+ - Multilingual African-centric pretraining dataset
75
+ - ~270M parameters (depending on variant)
76
+
77
+ ### **Training Hyperparameters**
78
+ - Epochs: **10**
79
+ - Batch size: **16 (training), 64 (evaluation)**
80
+ - Optimizer: **AdamW**
81
+ - Learning rate: **1e-5**
82
+ - Weight decay: **0.01**
83
+ - Loss function: **class-weighted cross entropy**
84
+ - Weights = `[1.0, 2.0]` (non-offensive, offensive)
85
+
86
+ ### **Hardware**
87
+ - Trained using Google Colab GPU (T4/A100 depending on session).
88
+
89
+ ---
90
+
91
+ ## 5. Evaluation Methodology
92
+
93
+ The dataset split follows:
94
+
95
+ - **80% training**
96
+ - **20% held-out test set**
97
+ - 5-fold stratified cross-validation used during model selection.
98
+ - No semantic triggers or augmentations present in the test set.
99
+
100
+ Evaluation uses the following metrics:
101
+
102
+ - Accuracy
103
+ - Macro F1
104
+ - Recall for offensive class
105
+ - Matthews Correlation Coefficient (MCC)
106
+ - ROC-AUC
107
+ - Runtime speed
108
+
109
+ ---
110
+
111
+ ## 6. Test Set Results (Final Model)
112
+
113
+ | Metric | Value |
114
+ |--------|--------|
115
+ | **Accuracy** | 0.8622 |
116
+ | **Macro F1-score** | 0.8603 |
117
+ | **Recall (Offensive = 1)** | 0.8111 |
118
+ | **MCC** | 0.7229 |
119
+ | **ROC-AUC** | 0.9015 |
120
+ | **Loss** | 0.3895 |
121
+ | **Runtime (seconds)** | 1.1634 |
122
+ | **Samples per second** | 168.468 |
123
+ | **Steps per second** | 3.438 |
124
+
125
+ ### Interpretation
126
+ - The **ROC-AUC of 0.90** demonstrates strong separation between offensive and non-offensive classes.
127
+ - **MCC = 0.7229** indicates strong classification reliability in mildly imbalanced data.
128
+ - **Recall(1) = 0.8111** means the model captures most harmful/offensive cases — useful for forensic workflows where false negatives are costly.
129
+ - Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space.
130
+
131
+ Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection.
132
+
133
+ ---
134
+
135
+ ## 7. How to Use the Model
136
+
137
+ ### **Python Inference Example**
138
+ ```python
139
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
140
+ import torch
141
+
142
+ model_name = "mopatik/Afro-XLM-R-offensive-detection-v1"
143
+
144
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
145
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
146
+
147
+ text = "O seso tota"
148
+ inputs = tokenizer(text, return_tensors="pt")
149
+ logits = model(**inputs).logits
150
+ probs = torch.softmax(logits, dim=1)
151
+
152
+ print("Probabilities:", probs)
153
+ print("Predicted class:", torch.argmax(probs).item())