File size: 14,831 Bytes

251e4e5
b72975c
 
47d930e
 
b72975c
47d930e

---
language:
- tn
license: cc-by-4.0
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- offensive-language-detection
- setswana
- low-resource-nlp
- digital-forensics
- explainable-ai
- rationale-learning
- masked-rationale-prediction
- puoberta
- lime
- s-lime
metrics:
- accuracy
- f1
- matthews_correlation
- roc_auc
---

# PuoBERTa-MRP for Setswana Offensive Content Detection

## Model Summary

This repository contains **PuoBERTa-MRP**, a rationale-aware fine-tuned version of **PuoBERTa** for binary offensive-content detection in Setswana.

The model classifies Setswana text into:

| Label ID | Label |
|---:|---|
| 0 | Non-offensive |
| 1 | Offensive |

The model was developed for research on **low-resource African language NLP**, **digital forensic investigation**, and **explainable offensive-language detection**. The MRP version extends the standard PuoBERTa fine-tuning setup by incorporating **Masked Rationale Prediction (MRP)** as a rationale-aware training and evaluation strategy.

In this work, *rationales* refer to semantically important offensive spans or trigger expressions that contribute to the offensive classification decision. These spans are used during model development to study whether the classifier relies on linguistically meaningful cues rather than shallow lexical shortcuts.

---

## What is MRP?

**MRP** stands for **Masked Rationale Prediction**.

The purpose of the MRP setup is to test and improve the relationship between:

- sentence-level offensive classification,
- annotated semantic trigger spans,
- masked or neutralised rationale regions,
- and explanation faithfulness.

In the MRP setting, annotated offensive rationales are used to create controlled training or diagnostic variants in which key offensive spans may be masked, removed, or neutralised. This allows the researcher to examine whether the model:

1. depends only on explicit offensive tokens;
2. uses broader contextual patterns;
3. remains robust when rationale-bearing terms are masked;
4. produces explanations aligned with annotated semantic triggers.

This makes the model useful not only for classification, but also for **forensic explainability analysis**.

---

## Research Motivation

Offensive-language detection in Setswana presents challenges that are not fully addressed by ordinary sentence-level classification. Offensive meaning may be expressed through:

- culturally specific insults,
- idiomatic expressions,
- indirect accusations,
- threats,
- phishing-related cues,
- sarcasm,
- dehumanising metaphors,
- and code-switched or non-standard orthography.

In small low-resource datasets, a model may overfit to obvious abusive terms while failing to capture broader discourse structures. MRP is introduced to investigate whether rationale masking can reveal or reduce such dependency.

The central research question is:

> Can rationale-aware masking improve the interpretability and robustness of Setswana offensive-language detection while preserving useful classification performance?

---

## Intended Use

This model is intended for:

- Setswana offensive-language detection research;
- cyberbullying and harassment detection experiments;
- digital forensic triage support;
- explainable AI experiments;
- LIME and S-LIME attribution analysis;
- masked rationale and counterfactual evaluation;
- benchmarking rationale-aware transformer models for low-resource languages.

It may be useful in research workflows where the goal is to analyse both:

- **what the model predicts**, and
- **why the model predicts it**.

---

## Out-of-Scope Use

This model should **not** be used for:

- fully automated legal decision-making;
- disciplinary action without human review;
- automated criminal attribution;
- autonomous social media moderation;
- profiling individuals or communities;
- deployment on non-Setswana text without validation.

The model is intended to support research and forensic triage, not replace human interpretation.

---

## Dataset Description

The model is based on a manually curated Setswana offensive-language corpus containing offensive and non-offensive examples.

The dataset follows a simple CSV structure compatible with common offensive-language NLP datasets such as OLID and HateCheck:

```csv
TEXT,TARGET
```

Where:

| Column | Description |
|---|---|
| `TEXT` | Setswana sentence or comment |
| `TARGET` | Class label: `Offensive` or `Non-offensive` |

The broader corpus contains approximately:

| Class | Count |
|---|---:|
| Non-offensive | 500 |
| Offensive | 477 |
| Total | 977 |

If using the public merged release, verify the exact row count in the dataset card and release notes, as sanitised or release-ready versions may differ slightly from the internal experimental corpus.

---

## Rationale and Trigger Annotation

During dataset preparation, semantically important offensive spans were annotated as rationales or trigger regions.

These rationales may include:

- direct insults;
- vulgar expressions;
- harassment phrases;
- threat expressions;
- phishing or scam cues;
- dehumanising metaphors;
- culturally grounded abusive expressions.

Example rationale-style annotation:

```text
O tshwanetse go tlogela <TRIGGER>boaka</TRIGGER>
```

For MRP experiments, such spans can be converted into masked variants, for example:

```text
O tshwanetse go tlogela <MASK>
```

or neutralised variants, depending on the experiment design.

---

## Evaluation Setting

A key principle of this work is that the model should be assessed under realistic conditions.

Therefore, final evaluation should be performed on:

- tag-free text,
- unmasked ordinary inputs,
- and a held-out test set not used during training or tuning.

This avoids giving the model artificial markup during deployment-like testing.

The evaluation protocol follows:

- 80/20 train-test split;
- 5-fold stratified cross-validation on the training partition;
- final evaluation on the untouched holdout test set;
- tag-free inference during final testing;
- rationale-aware analysis through masking and counterfactual evaluation.

---

## Model Architecture

| Component | Details |
|---|---|
| Base model | PuoBERTa |
| Architecture family | RoBERTa |
| Task | Sequence classification |
| Language | Setswana |
| ISO language code | `tn` |
| Number of labels | 2 |
| Framework | Hugging Face Transformers |
| Backend | PyTorch |

---

## Training Configuration

The model was fine-tuned using a transformer sequence-classification setup.

Typical configuration:

| Parameter | Value |
|---|---:|
| Maximum sequence length | 128 |
| Optimizer | AdamW |
| Learning rate | 1e-5 |
| Weight decay | 0.01 |
| Training batch size | 16 |
| Evaluation batch size | 64 |
| Loss function | Class-weighted cross-entropy |
| Class weights | `[1.0, 2.0]` |
| Model selection focus | Offensive-class recall |

The offensive class was assigned a higher loss weight to reduce the risk of missing harmful instances.

---

## MRP-Specific Training / Analysis Workflow

The MRP workflow may include the following steps:

1. Train or fine-tune the classifier on labelled Setswana text.
2. Use annotated semantic rationales to identify offensive spans.
3. Create masked-rationale variants of selected samples.
4. Evaluate prediction changes after masking.
5. Compare original and masked predictions.
6. Use LIME or S-LIME to inspect whether top-attributed tokens align with annotated rationales.
7. Analyse flip and non-flip cases to determine whether the model depends on explicit offensive tokens or broader contextual templates.

This workflow supports both predictive evaluation and forensic interpretability.

---

## Test Set Results

Insert the final MRP test-set metrics below once confirmed.

| Metric | Value |
|---|---:|
| Accuracy | 0.74 |
| Macro F1-score | `0.74` |
| Recall: Offensive class | `0.81` |
| MCC | `TO_BE_ADDED` |
| ROC-AUC | `TO_BE_ADDED` |
| Loss | `1.820457` |

Example format:

```text
accuracy = 0.xxxx
macro_f1 = 0.xxxx
recall_1 = 0.xxxx
mcc = 0.xxxx
roc_auc = 0.xxxx
```

Do not reuse metrics from the standard PuoBERTa or train-time trigger model unless they are from the exact MRP run.

---

## Explainability

This model is designed to support explainability experiments, especially:

- LIME;
- S-LIME;
- token-level attribution;
- masked-rationale comparison;
- counterfactual trigger neutralisation;
- rationale-alignment analysis.

In rationale-alignment analysis, the main question is whether the model’s most influential tokens overlap with human-annotated offensive rationales.

For example, if a human-annotated rationale is:

```text
<TRIGGER>o sematla</TRIGGER>
```

then a faithful explanation should assign strong attribution to the same phrase or semantically related parts of the sentence.

---

## Interpreting Attribution Scores

For LIME and S-LIME outputs:

- Positive attribution scores support the **Offensive** class.
- Negative attribution scores support the **Non-offensive** class.
- Stable attributions across random seeds indicate more reliable explanations.
- Large changes after rationale masking may indicate strong dependence on the masked phrase.
- Non-flip cases may indicate that surrounding context still carries offensive meaning.

MRP is therefore useful for distinguishing between:

- lexical reliance,
- contextual reasoning,
- and potentially spurious shortcut learning.

---

## Counterfactual and Masking Analysis

The MRP model can be evaluated using counterfactual edits such as:

| Original Type | Counterfactual Operation |
|---|---|
| Offensive rationale present | Mask offensive span |
| Offensive rationale present | Replace with neutral paraphrase |
| Offensive rationale present | Remove trigger span |
| Context preserved | Re-evaluate prediction |

A prediction flip from Offensive to Non-offensive may suggest that the model relied strongly on the rationale span.

A non-flip may suggest that offensive meaning is also encoded in the surrounding context, such as accusatory templates or threat-like phrasing.

---

## How to Use the Model

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR-USERNAME/YOUR-PUOBERTA-MRP-MODEL"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Ke dumela gore re tshwanetse go bua sentle."

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)
pred = torch.argmax(probs, dim=-1).item()

label_map = {
    0: "Non-offensive",
    1: "Offensive"
}

print("Prediction:", label_map[pred])
print("Probabilities:", probs.tolist())
```

---

## Optional: Masked Rationale Diagnostic Example

The following is a diagnostic workflow for research use only.

```python
original_text = "O tshwanetse go tlogela boaka"
masked_text = "O tshwanetse go tlogela <mask>"

texts = [original_text, masked_text]

inputs = tokenizer(
    texts,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)

for text, prob in zip(texts, probs):
    print(text)
    print(prob.tolist())
```

Use this only if your tokenizer/model configuration supports the mask token appropriately.

---

## Limitations

The model has several limitations:

- The dataset is relatively small.
- The model is trained primarily for Setswana.
- It may be sensitive to spelling variation and informal orthography.
- It may struggle with sarcasm, irony, and implicit abuse.
- It may underperform on unseen slang or emerging online expressions.
- It performs binary classification only.
- It does not classify offensive subtypes such as hate speech, harassment, threat, or phishing separately.
- Rationale masking can help diagnosis, but it does not prove causal reasoning.

---

## Ethical Considerations

This model deals with offensive and potentially harmful language. It should be used carefully and only in appropriate research or forensic contexts.

Recommended safeguards:

- human-in-the-loop review;
- calibrated confidence thresholds;
- abstention for uncertain predictions;
- careful error analysis;
- avoidance of automated punitive action;
- compliance with data protection and cybercrime legislation;
- masking or sanitisation of offensive examples in public outputs.

The model should not be used as the sole basis for legal, disciplinary, or investigative conclusions.

---

## Bias and Fairness Considerations

Potential sources of bias include:

- sampling bias from public social media content;
- underrepresentation of dialectal variants;
- limited coverage of emerging slang;
- ambiguity in culturally specific phrases;
- and label uncertainty in sarcastic or metaphorical cases.

Users should validate the model on their own target domain before applying it in practical settings.

---

## Reproducibility

Related reproducibility resources may include:

- training notebooks;
- MRP experiment notebooks;
- LIME/S-LIME explainability notebooks;
- scripts for generating tables and figures;
- sanitised output files;
- dataset card;
- model card;
- Zenodo release.

Associated GitHub repository:

```text
https://github.com/bkekgathetse/setswana-offensive-977
```

Associated Hugging Face dataset:

```text
ADD_DATASET_LINK_HERE
```

Associated Zenodo release:

```text
ADD_ZENODO_DOI_HERE
```

---

## Recommended Citation

```bibtex
@misc{kekgathetse2025puoberta_mrp,
  title={PuoBERTa-MRP for Setswana Offensive Content Detection},
  author={Kekgathetse, Bernerdict},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/YOUR-USERNAME/YOUR-PUOBERTA-MRP-MODEL}}
}
```

If this model is linked to a manuscript, cite the corresponding paper as well:

```bibtex
@article{kekgathetse2025setswanaoffensive,
  title={Developing Monolingual Setswana Datasets for Offensive Content Detection},
  author={Kekgathetse, Bernerdict},
  journal={To be updated},
  year={2025}
}
```

---

## License

Please refer to the license specified in this repository.

Recommended licensing structure:

- Code: MIT or Apache-2.0
- Documentation: CC-BY 4.0
- Dataset access: governed separately due to ethical considerations

---

## Contact

For academic queries, reproducibility questions, or collaboration requests, please refer to the associated GitHub repository or manuscript contact details.

---

## Model Card Notes

This model card describes the MRP version of the PuoBERTa offensive-content classifier. It should be updated with the exact final test metrics and repository links before public release.