PuoBERTa-MRP for Setswana Offensive Content Detection

Model Summary

This repository contains PuoBERTa-MRP, a rationale-aware fine-tuned version of PuoBERTa for binary offensive-content detection in Setswana.

The model classifies Setswana text into:

Label ID	Label
0	Non-offensive
1	Offensive

The model was developed for research on low-resource African language NLP, digital forensic investigation, and explainable offensive-language detection. The MRP version extends the standard PuoBERTa fine-tuning setup by incorporating Masked Rationale Prediction (MRP) as a rationale-aware training and evaluation strategy.

In this work, rationales refer to semantically important offensive spans or trigger expressions that contribute to the offensive classification decision. These spans are used during model development to study whether the classifier relies on linguistically meaningful cues rather than shallow lexical shortcuts.

What is MRP?

MRP stands for Masked Rationale Prediction.

The purpose of the MRP setup is to test and improve the relationship between:

sentence-level offensive classification,
annotated semantic trigger spans,
masked or neutralised rationale regions,
and explanation faithfulness.

In the MRP setting, annotated offensive rationales are used to create controlled training or diagnostic variants in which key offensive spans may be masked, removed, or neutralised. This allows the researcher to examine whether the model:

depends only on explicit offensive tokens;
uses broader contextual patterns;
remains robust when rationale-bearing terms are masked;
produces explanations aligned with annotated semantic triggers.

This makes the model useful not only for classification, but also for forensic explainability analysis.

Research Motivation

Offensive-language detection in Setswana presents challenges that are not fully addressed by ordinary sentence-level classification. Offensive meaning may be expressed through:

culturally specific insults,
idiomatic expressions,
indirect accusations,
threats,
phishing-related cues,
sarcasm,
dehumanising metaphors,
and code-switched or non-standard orthography.

In small low-resource datasets, a model may overfit to obvious abusive terms while failing to capture broader discourse structures. MRP is introduced to investigate whether rationale masking can reveal or reduce such dependency.

The central research question is:

Can rationale-aware masking improve the interpretability and robustness of Setswana offensive-language detection while preserving useful classification performance?

Intended Use

This model is intended for:

Setswana offensive-language detection research;
cyberbullying and harassment detection experiments;
digital forensic triage support;
explainable AI experiments;
LIME and S-LIME attribution analysis;
masked rationale and counterfactual evaluation;
benchmarking rationale-aware transformer models for low-resource languages.

It may be useful in research workflows where the goal is to analyse both:

what the model predicts, and
why the model predicts it.

Out-of-Scope Use

This model should not be used for:

fully automated legal decision-making;
disciplinary action without human review;
automated criminal attribution;
autonomous social media moderation;
profiling individuals or communities;
deployment on non-Setswana text without validation.

The model is intended to support research and forensic triage, not replace human interpretation.

Dataset Description

The model is based on a manually curated Setswana offensive-language corpus containing offensive and non-offensive examples.

The dataset follows a simple CSV structure compatible with common offensive-language NLP datasets such as OLID and HateCheck:

TEXT,TARGET

Where:

Column	Description
`TEXT`	Setswana sentence or comment
`TARGET`	Class label: `Offensive` or `Non-offensive`

The broader corpus contains approximately:

Class	Count
Non-offensive	500
Offensive	477
Total	977

If using the public merged release, verify the exact row count in the dataset card and release notes, as sanitised or release-ready versions may differ slightly from the internal experimental corpus.

Rationale and Trigger Annotation

During dataset preparation, semantically important offensive spans were annotated as rationales or trigger regions.

These rationales may include:

direct insults;
vulgar expressions;
harassment phrases;
threat expressions;
phishing or scam cues;
dehumanising metaphors;
culturally grounded abusive expressions.

Example rationale-style annotation:

O tshwanetse go tlogela <TRIGGER>boaka</TRIGGER>

For MRP experiments, such spans can be converted into masked variants, for example:

O tshwanetse go tlogela <MASK>

or neutralised variants, depending on the experiment design.

Evaluation Setting

A key principle of this work is that the model should be assessed under realistic conditions.

Therefore, final evaluation should be performed on:

tag-free text,
unmasked ordinary inputs,
and a held-out test set not used during training or tuning.

This avoids giving the model artificial markup during deployment-like testing.

The evaluation protocol follows:

80/20 train-test split;
5-fold stratified cross-validation on the training partition;
final evaluation on the untouched holdout test set;
tag-free inference during final testing;
rationale-aware analysis through masking and counterfactual evaluation.

Model Architecture

Component	Details
Base model	PuoBERTa
Architecture family	RoBERTa
Task	Sequence classification
Language	Setswana
ISO language code	`tn`
Number of labels	2
Framework	Hugging Face Transformers
Backend	PyTorch

Training Configuration

The model was fine-tuned using a transformer sequence-classification setup.

Typical configuration:

Parameter	Value
Maximum sequence length	128
Optimizer	AdamW
Learning rate	1e-5
Weight decay	0.01
Training batch size	16
Evaluation batch size	64
Loss function	Class-weighted cross-entropy
Class weights	`[1.0, 2.0]`
Model selection focus	Offensive-class recall

The offensive class was assigned a higher loss weight to reduce the risk of missing harmful instances.

MRP-Specific Training / Analysis Workflow

The MRP workflow may include the following steps:

Train or fine-tune the classifier on labelled Setswana text.
Use annotated semantic rationales to identify offensive spans.
Create masked-rationale variants of selected samples.
Evaluate prediction changes after masking.
Compare original and masked predictions.
Use LIME or S-LIME to inspect whether top-attributed tokens align with annotated rationales.
Analyse flip and non-flip cases to determine whether the model depends on explicit offensive tokens or broader contextual templates.

This workflow supports both predictive evaluation and forensic interpretability.

Test Set Results

Insert the final MRP test-set metrics below once confirmed.

Metric	Value
Accuracy	0.74
Macro F1-score	`0.74`
Recall: Offensive class	`0.81`
MCC	`TO_BE_ADDED`
ROC-AUC	`TO_BE_ADDED`
Loss	`1.820457`

Example format:

accuracy = 0.xxxx
macro_f1 = 0.xxxx
recall_1 = 0.xxxx
mcc = 0.xxxx
roc_auc = 0.xxxx

Do not reuse metrics from the standard PuoBERTa or train-time trigger model unless they are from the exact MRP run.

Explainability

This model is designed to support explainability experiments, especially:

LIME;
S-LIME;
token-level attribution;
masked-rationale comparison;
counterfactual trigger neutralisation;
rationale-alignment analysis.

In rationale-alignment analysis, the main question is whether the model’s most influential tokens overlap with human-annotated offensive rationales.

For example, if a human-annotated rationale is:

<TRIGGER>o sematla</TRIGGER>

then a faithful explanation should assign strong attribution to the same phrase or semantically related parts of the sentence.

Interpreting Attribution Scores

For LIME and S-LIME outputs:

Positive attribution scores support the Offensive class.
Negative attribution scores support the Non-offensive class.
Stable attributions across random seeds indicate more reliable explanations.
Large changes after rationale masking may indicate strong dependence on the masked phrase.
Non-flip cases may indicate that surrounding context still carries offensive meaning.

MRP is therefore useful for distinguishing between:

lexical reliance,
contextual reasoning,
and potentially spurious shortcut learning.

Counterfactual and Masking Analysis

The MRP model can be evaluated using counterfactual edits such as:

Original Type	Counterfactual Operation
Offensive rationale present	Mask offensive span
Offensive rationale present	Replace with neutral paraphrase
Offensive rationale present	Remove trigger span
Context preserved	Re-evaluate prediction

A prediction flip from Offensive to Non-offensive may suggest that the model relied strongly on the rationale span.

A non-flip may suggest that offensive meaning is also encoded in the surrounding context, such as accusatory templates or threat-like phrasing.

How to Use the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR-USERNAME/YOUR-PUOBERTA-MRP-MODEL"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Ke dumela gore re tshwanetse go bua sentle."

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)
pred = torch.argmax(probs, dim=-1).item()

label_map = {
    0: "Non-offensive",
    1: "Offensive"
}

print("Prediction:", label_map[pred])
print("Probabilities:", probs.tolist())

Optional: Masked Rationale Diagnostic Example

The following is a diagnostic workflow for research use only.

original_text = "O tshwanetse go tlogela boaka"
masked_text = "O tshwanetse go tlogela <mask>"

texts = [original_text, masked_text]

inputs = tokenizer(
    texts,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)

for text, prob in zip(texts, probs):
    print(text)
    print(prob.tolist())

Use this only if your tokenizer/model configuration supports the mask token appropriately.

Limitations

The model has several limitations:

The dataset is relatively small.
The model is trained primarily for Setswana.
It may be sensitive to spelling variation and informal orthography.
It may struggle with sarcasm, irony, and implicit abuse.
It may underperform on unseen slang or emerging online expressions.
It performs binary classification only.
It does not classify offensive subtypes such as hate speech, harassment, threat, or phishing separately.
Rationale masking can help diagnosis, but it does not prove causal reasoning.

Ethical Considerations

This model deals with offensive and potentially harmful language. It should be used carefully and only in appropriate research or forensic contexts.

Recommended safeguards:

human-in-the-loop review;
calibrated confidence thresholds;
abstention for uncertain predictions;
careful error analysis;
avoidance of automated punitive action;
compliance with data protection and cybercrime legislation;
masking or sanitisation of offensive examples in public outputs.

The model should not be used as the sole basis for legal, disciplinary, or investigative conclusions.

Bias and Fairness Considerations

Potential sources of bias include:

sampling bias from public social media content;
underrepresentation of dialectal variants;
limited coverage of emerging slang;
ambiguity in culturally specific phrases;
and label uncertainty in sarcastic or metaphorical cases.

Users should validate the model on their own target domain before applying it in practical settings.

Reproducibility

Related reproducibility resources may include:

training notebooks;
MRP experiment notebooks;
LIME/S-LIME explainability notebooks;
scripts for generating tables and figures;
sanitised output files;
dataset card;
model card;
Zenodo release.

Associated GitHub repository:

https://github.com/bkekgathetse/setswana-offensive-977

Associated Hugging Face dataset:

ADD_DATASET_LINK_HERE

Associated Zenodo release:

ADD_ZENODO_DOI_HERE

Recommended Citation

@misc{kekgathetse2025puoberta_mrp,
  title={PuoBERTa-MRP for Setswana Offensive Content Detection},
  author={Kekgathetse, Bernerdict},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/YOUR-USERNAME/YOUR-PUOBERTA-MRP-MODEL}}
}

If this model is linked to a manuscript, cite the corresponding paper as well:

@article{kekgathetse2025setswanaoffensive,
  title={Developing Monolingual Setswana Datasets for Offensive Content Detection},
  author={Kekgathetse, Bernerdict},
  journal={To be updated},
  year={2025}
}

License

Please refer to the license specified in this repository.

Recommended licensing structure:

Code: MIT or Apache-2.0
Documentation: CC-BY 4.0
Dataset access: governed separately due to ethical considerations

Contact

For academic queries, reproducibility questions, or collaboration requests, please refer to the associated GitHub repository or manuscript contact details.

Model Card Notes

This model card describes the MRP version of the PuoBERTa offensive-content classifier. It should be updated with the exact final test metrics and repository links before public release.

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32