File size: 14,831 Bytes
251e4e5
b72975c
 
47d930e
 
b72975c
47d930e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
---
language:
- tn
license: cc-by-4.0
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- offensive-language-detection
- setswana
- low-resource-nlp
- digital-forensics
- explainable-ai
- rationale-learning
- masked-rationale-prediction
- puoberta
- lime
- s-lime
metrics:
- accuracy
- f1
- matthews_correlation
- roc_auc
---

# PuoBERTa-MRP for Setswana Offensive Content Detection

## Model Summary

This repository contains **PuoBERTa-MRP**, a rationale-aware fine-tuned version of **PuoBERTa** for binary offensive-content detection in Setswana.

The model classifies Setswana text into:

| Label ID | Label |
|---:|---|
| 0 | Non-offensive |
| 1 | Offensive |

The model was developed for research on **low-resource African language NLP**, **digital forensic investigation**, and **explainable offensive-language detection**. The MRP version extends the standard PuoBERTa fine-tuning setup by incorporating **Masked Rationale Prediction (MRP)** as a rationale-aware training and evaluation strategy.

In this work, *rationales* refer to semantically important offensive spans or trigger expressions that contribute to the offensive classification decision. These spans are used during model development to study whether the classifier relies on linguistically meaningful cues rather than shallow lexical shortcuts.

---

## What is MRP?

**MRP** stands for **Masked Rationale Prediction**.

The purpose of the MRP setup is to test and improve the relationship between:

- sentence-level offensive classification,
- annotated semantic trigger spans,
- masked or neutralised rationale regions,
- and explanation faithfulness.

In the MRP setting, annotated offensive rationales are used to create controlled training or diagnostic variants in which key offensive spans may be masked, removed, or neutralised. This allows the researcher to examine whether the model:

1. depends only on explicit offensive tokens;
2. uses broader contextual patterns;
3. remains robust when rationale-bearing terms are masked;
4. produces explanations aligned with annotated semantic triggers.

This makes the model useful not only for classification, but also for **forensic explainability analysis**.

---

## Research Motivation

Offensive-language detection in Setswana presents challenges that are not fully addressed by ordinary sentence-level classification. Offensive meaning may be expressed through:

- culturally specific insults,
- idiomatic expressions,
- indirect accusations,
- threats,
- phishing-related cues,
- sarcasm,
- dehumanising metaphors,
- and code-switched or non-standard orthography.

In small low-resource datasets, a model may overfit to obvious abusive terms while failing to capture broader discourse structures. MRP is introduced to investigate whether rationale masking can reveal or reduce such dependency.

The central research question is:

> Can rationale-aware masking improve the interpretability and robustness of Setswana offensive-language detection while preserving useful classification performance?

---

## Intended Use

This model is intended for:

- Setswana offensive-language detection research;
- cyberbullying and harassment detection experiments;
- digital forensic triage support;
- explainable AI experiments;
- LIME and S-LIME attribution analysis;
- masked rationale and counterfactual evaluation;
- benchmarking rationale-aware transformer models for low-resource languages.

It may be useful in research workflows where the goal is to analyse both:

- **what the model predicts**, and
- **why the model predicts it**.

---

## Out-of-Scope Use

This model should **not** be used for:

- fully automated legal decision-making;
- disciplinary action without human review;
- automated criminal attribution;
- autonomous social media moderation;
- profiling individuals or communities;
- deployment on non-Setswana text without validation.

The model is intended to support research and forensic triage, not replace human interpretation.

---

## Dataset Description

The model is based on a manually curated Setswana offensive-language corpus containing offensive and non-offensive examples.

The dataset follows a simple CSV structure compatible with common offensive-language NLP datasets such as OLID and HateCheck:

```csv
TEXT,TARGET
```

Where:

| Column | Description |
|---|---|
| `TEXT` | Setswana sentence or comment |
| `TARGET` | Class label: `Offensive` or `Non-offensive` |

The broader corpus contains approximately:

| Class | Count |
|---|---:|
| Non-offensive | 500 |
| Offensive | 477 |
| Total | 977 |

If using the public merged release, verify the exact row count in the dataset card and release notes, as sanitised or release-ready versions may differ slightly from the internal experimental corpus.

---

## Rationale and Trigger Annotation

During dataset preparation, semantically important offensive spans were annotated as rationales or trigger regions.

These rationales may include:

- direct insults;
- vulgar expressions;
- harassment phrases;
- threat expressions;
- phishing or scam cues;
- dehumanising metaphors;
- culturally grounded abusive expressions.

Example rationale-style annotation:

```text
O tshwanetse go tlogela <TRIGGER>boaka</TRIGGER>
```

For MRP experiments, such spans can be converted into masked variants, for example:

```text
O tshwanetse go tlogela <MASK>
```

or neutralised variants, depending on the experiment design.

---

## Evaluation Setting

A key principle of this work is that the model should be assessed under realistic conditions.

Therefore, final evaluation should be performed on:

- tag-free text,
- unmasked ordinary inputs,
- and a held-out test set not used during training or tuning.

This avoids giving the model artificial markup during deployment-like testing.

The evaluation protocol follows:

- 80/20 train-test split;
- 5-fold stratified cross-validation on the training partition;
- final evaluation on the untouched holdout test set;
- tag-free inference during final testing;
- rationale-aware analysis through masking and counterfactual evaluation.

---

## Model Architecture

| Component | Details |
|---|---|
| Base model | PuoBERTa |
| Architecture family | RoBERTa |
| Task | Sequence classification |
| Language | Setswana |
| ISO language code | `tn` |
| Number of labels | 2 |
| Framework | Hugging Face Transformers |
| Backend | PyTorch |

---

## Training Configuration

The model was fine-tuned using a transformer sequence-classification setup.

Typical configuration:

| Parameter | Value |
|---|---:|
| Maximum sequence length | 128 |
| Optimizer | AdamW |
| Learning rate | 1e-5 |
| Weight decay | 0.01 |
| Training batch size | 16 |
| Evaluation batch size | 64 |
| Loss function | Class-weighted cross-entropy |
| Class weights | `[1.0, 2.0]` |
| Model selection focus | Offensive-class recall |

The offensive class was assigned a higher loss weight to reduce the risk of missing harmful instances.

---

## MRP-Specific Training / Analysis Workflow

The MRP workflow may include the following steps:

1. Train or fine-tune the classifier on labelled Setswana text.
2. Use annotated semantic rationales to identify offensive spans.
3. Create masked-rationale variants of selected samples.
4. Evaluate prediction changes after masking.
5. Compare original and masked predictions.
6. Use LIME or S-LIME to inspect whether top-attributed tokens align with annotated rationales.
7. Analyse flip and non-flip cases to determine whether the model depends on explicit offensive tokens or broader contextual templates.

This workflow supports both predictive evaluation and forensic interpretability.

---

## Test Set Results

Insert the final MRP test-set metrics below once confirmed.

| Metric | Value |
|---|---:|
| Accuracy | 0.74 |
| Macro F1-score | `0.74` |
| Recall: Offensive class | `0.81` |
| MCC | `TO_BE_ADDED` |
| ROC-AUC | `TO_BE_ADDED` |
| Loss | `1.820457` |

Example format:

```text
accuracy = 0.xxxx
macro_f1 = 0.xxxx
recall_1 = 0.xxxx
mcc = 0.xxxx
roc_auc = 0.xxxx
```

Do not reuse metrics from the standard PuoBERTa or train-time trigger model unless they are from the exact MRP run.

---

## Explainability

This model is designed to support explainability experiments, especially:

- LIME;
- S-LIME;
- token-level attribution;
- masked-rationale comparison;
- counterfactual trigger neutralisation;
- rationale-alignment analysis.

In rationale-alignment analysis, the main question is whether the model’s most influential tokens overlap with human-annotated offensive rationales.

For example, if a human-annotated rationale is:

```text
<TRIGGER>o sematla</TRIGGER>
```

then a faithful explanation should assign strong attribution to the same phrase or semantically related parts of the sentence.

---

## Interpreting Attribution Scores

For LIME and S-LIME outputs:

- Positive attribution scores support the **Offensive** class.
- Negative attribution scores support the **Non-offensive** class.
- Stable attributions across random seeds indicate more reliable explanations.
- Large changes after rationale masking may indicate strong dependence on the masked phrase.
- Non-flip cases may indicate that surrounding context still carries offensive meaning.

MRP is therefore useful for distinguishing between:

- lexical reliance,
- contextual reasoning,
- and potentially spurious shortcut learning.

---

## Counterfactual and Masking Analysis

The MRP model can be evaluated using counterfactual edits such as:

| Original Type | Counterfactual Operation |
|---|---|
| Offensive rationale present | Mask offensive span |
| Offensive rationale present | Replace with neutral paraphrase |
| Offensive rationale present | Remove trigger span |
| Context preserved | Re-evaluate prediction |

A prediction flip from Offensive to Non-offensive may suggest that the model relied strongly on the rationale span.

A non-flip may suggest that offensive meaning is also encoded in the surrounding context, such as accusatory templates or threat-like phrasing.

---

## How to Use the Model

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR-USERNAME/YOUR-PUOBERTA-MRP-MODEL"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Ke dumela gore re tshwanetse go bua sentle."

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)
pred = torch.argmax(probs, dim=-1).item()

label_map = {
    0: "Non-offensive",
    1: "Offensive"
}

print("Prediction:", label_map[pred])
print("Probabilities:", probs.tolist())
```

---

## Optional: Masked Rationale Diagnostic Example

The following is a diagnostic workflow for research use only.

```python
original_text = "O tshwanetse go tlogela boaka"
masked_text = "O tshwanetse go tlogela <mask>"

texts = [original_text, masked_text]

inputs = tokenizer(
    texts,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)

for text, prob in zip(texts, probs):
    print(text)
    print(prob.tolist())
```

Use this only if your tokenizer/model configuration supports the mask token appropriately.

---

## Limitations

The model has several limitations:

- The dataset is relatively small.
- The model is trained primarily for Setswana.
- It may be sensitive to spelling variation and informal orthography.
- It may struggle with sarcasm, irony, and implicit abuse.
- It may underperform on unseen slang or emerging online expressions.
- It performs binary classification only.
- It does not classify offensive subtypes such as hate speech, harassment, threat, or phishing separately.
- Rationale masking can help diagnosis, but it does not prove causal reasoning.

---

## Ethical Considerations

This model deals with offensive and potentially harmful language. It should be used carefully and only in appropriate research or forensic contexts.

Recommended safeguards:

- human-in-the-loop review;
- calibrated confidence thresholds;
- abstention for uncertain predictions;
- careful error analysis;
- avoidance of automated punitive action;
- compliance with data protection and cybercrime legislation;
- masking or sanitisation of offensive examples in public outputs.

The model should not be used as the sole basis for legal, disciplinary, or investigative conclusions.

---

## Bias and Fairness Considerations

Potential sources of bias include:

- sampling bias from public social media content;
- underrepresentation of dialectal variants;
- limited coverage of emerging slang;
- ambiguity in culturally specific phrases;
- and label uncertainty in sarcastic or metaphorical cases.

Users should validate the model on their own target domain before applying it in practical settings.

---

## Reproducibility

Related reproducibility resources may include:

- training notebooks;
- MRP experiment notebooks;
- LIME/S-LIME explainability notebooks;
- scripts for generating tables and figures;
- sanitised output files;
- dataset card;
- model card;
- Zenodo release.

Associated GitHub repository:

```text
https://github.com/bkekgathetse/setswana-offensive-977
```

Associated Hugging Face dataset:

```text
ADD_DATASET_LINK_HERE
```

Associated Zenodo release:

```text
ADD_ZENODO_DOI_HERE
```

---

## Recommended Citation

```bibtex
@misc{kekgathetse2025puoberta_mrp,
  title={PuoBERTa-MRP for Setswana Offensive Content Detection},
  author={Kekgathetse, Bernerdict},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/YOUR-USERNAME/YOUR-PUOBERTA-MRP-MODEL}}
}
```

If this model is linked to a manuscript, cite the corresponding paper as well:

```bibtex
@article{kekgathetse2025setswanaoffensive,
  title={Developing Monolingual Setswana Datasets for Offensive Content Detection},
  author={Kekgathetse, Bernerdict},
  journal={To be updated},
  year={2025}
}
```

---

## License

Please refer to the license specified in this repository.

Recommended licensing structure:

- Code: MIT or Apache-2.0
- Documentation: CC-BY 4.0
- Dataset access: governed separately due to ethical considerations

---

## Contact

For academic queries, reproducibility questions, or collaboration requests, please refer to the associated GitHub repository or manuscript contact details.

---

## Model Card Notes

This model card describes the MRP version of the PuoBERTa offensive-content classifier. It should be updated with the exact final test metrics and repository links before public release.