- PuoBERTa with Train-Time Semantic Triggers for Setswana Offensive Content Detection
- Research Context
- Semantic Trigger Supervision
- Why Train-Time Triggers?
- Trigger Annotation Strategy
- Important Experimental Constraint
- Experimental Hypothesis
- Dataset Description
- Dataset Structure
- Annotation Procedure
- Inter-Annotator Agreement
- Model Architecture
- Training Configuration
- Evaluation Protocol
- Evaluation Metrics
- Explainability
- Counterfactual Analysis
- Intended Use Cases
- Out-of-Scope Use
- Ethical Considerations
- Limitations
- Reproducibility
- Example Usage
- Citation
- Related Resources
- License
- Tags
PuoBERTa with Train-Time Semantic Triggers for Setswana Offensive Content Detection
Model Overview
This repository hosts a fine-tuned version of PuoBERTa, a monolingual Setswana transformer language model derived from the RoBERTa architecture and adapted for binary offensive-content detection in Setswana digital communication.
The model was developed as part of ongoing research in:
- low-resource African language NLP,
- digital forensic investigation,
- explainable artificial intelligence (XAI),
- cybercrime linguistic analysis,
- and offensive-language detection in Setswana.
The objective of the work is to investigate whether semantic trigger supervision during training can improve offensive-language detection while still allowing the model to generalise to realistic, unannotated inference conditions.
The classifier performs binary prediction:
| Label | Meaning |
|---|---|
| 0 | Non-offensive |
| 1 | Offensive |
The model is specifically designed for:
- offensive-language detection,
- cyberbullying analysis,
- phishing/scam linguistic analysis,
- online harassment detection,
- explainable NLP research,
- and digital forensic triage support.
Research Context
Offensive-language detection in low-resource African languages remains significantly underexplored compared to high-resource languages such as English, German, or Chinese. Existing datasets for African languages are often:
- small,
- weakly annotated,
- heavily imbalanced,
- culturally sparse,
- or lacking explainability-oriented supervision.
In Setswana specifically, offensive meaning is frequently conveyed through:
- implicit accusatory structures,
- metaphorical expressions,
- euphemisms,
- slang,
- code-switching,
- culturally contextual phrases,
- and morphologically variable abusive constructions.
Traditional sentence-level supervision alone may therefore fail to adequately highlight the semantically dominant portions of offensive expressions.
To address this challenge, this work introduces train-time semantic trigger supervision.
Semantic Trigger Supervision
Core Idea
During training, offensive spans are explicitly marked using lightweight XML-style trigger tags:
<TRIGGER> ... </TRIGGER>
These tags identify semantically important abusive or suspicious regions within a sentence.
Example:
O tshwanetse go tlogela <TRIGGER>boaka</TRIGGER>
In this example:
- the full sentence provides contextual information;
- the trigger span identifies the dominant offensive cue;
- the model learns both:
- sentence-level classification,
- and implicit attention toward offensive semantic regions.
This approach provides weak span-level supervision without requiring a dedicated token-classification architecture.
Why Train-Time Triggers?
The motivation for semantic triggers comes from several observations:
1. Offensive Meaning Is Often Contextual
In Setswana discourse, offensiveness may emerge from:
- sentence framing,
- pragmatic intent,
- metaphorical reference,
- or accusatory structure,
rather than from isolated keywords alone.
Example:
o tla ipona
may imply:
- threat,
- intimidation,
- or warning,
depending on surrounding context.
2. Offensive Keywords Alone Are Insufficient
Some offensive phrases contain:
- benign-looking words,
- culturally contextual insults,
- or implicit hostility.
Pure keyword-based learning may therefore produce:
- unstable generalisation,
- overfitting,
- or poor recall.
3. Low-Resource Datasets Need Additional Supervision Signals
Because the dataset is relatively small compared to large-scale English corpora, semantic trigger tags provide an additional learning signal that encourages the transformer to focus on linguistically meaningful abusive regions during fine-tuning.
Trigger Annotation Strategy
Trigger Span Definition
Trigger spans correspond to offensive or suspicious semantic regions such as:
- insults,
- threats,
- phishing cues,
- dehumanising phrases,
- harassment expressions,
- abusive metaphors,
- or cyberbullying constructions.
Examples:
<TRIGGER>o sematla</TRIGGER>
ke tla go fitlhela <TRIGGER>o tla ipona</TRIGGER>
<TRIGGER>re fe dinomoro tsa gago</TRIGGER>
<TRIGGER>basadi ga ba tshwanela go tsaya ditshwetso</TRIGGER>
Important Experimental Constraint
Trigger Tags Are Used ONLY During Training
A critical aspect of this work is that semantic trigger tags are used exclusively during training.
Validation, testing, and deployment are all performed under tag-free conditions.
This means:
- validation inputs contain NO trigger markers;
- holdout test inputs contain NO trigger markers;
- real-world inference assumes ordinary Setswana text.
Example inference input:
O tshwanetse go tlogela boaka
NOT:
O tshwanetse go tlogela <TRIGGER>boaka</TRIGGER>
This protocol prevents unrealistic dependence on artificial tags during deployment.
Experimental Hypothesis
The central research hypothesis is:
Train-time semantic trigger supervision can guide transformer attention toward linguistically meaningful offensive cues while preserving realistic inference-time behaviour under tag-free conditions.
The expectation is that trigger supervision:
- improves semantic sensitivity,
- improves interpretability,
- improves offensive recall,
- and encourages the model to internalise offensive patterns rather than memorising explicit markers.
Dataset Description
The dataset consists of manually curated Setswana social media text samples collected from publicly accessible online discourse.
The corpus contains:
| Class | Count |
|---|---|
| Non-offensive | 500 |
| Offensive | 477 |
| Total | 977 |
The data includes examples involving:
- insults,
- harassment,
- cyberbullying,
- threats,
- phishing/scam language,
- discriminatory expressions,
- and vulgarity.
Dataset Structure
The corpus follows a CSV format inspired by established offensive-language datasets such as:
- OLID,
- HateCheck,
- and related NLP benchmarks.
Expected structure:
TEXT,TARGET
Where:
| Column | Description |
|---|---|
| TEXT | Setswana sentence or comment |
| TARGET | Offensive / Non-offensive label |
Annotation Procedure
The dataset was manually annotated using predefined operational guidelines.
Annotation Categories
Offensive content includes:
- profanity,
- harassment,
- hate speech,
- cyberbullying,
- threats,
- phishing/scam expressions,
- dehumanising language,
- and targeted abuse.
Non-offensive content includes:
- neutral discourse,
- ordinary conversation,
- benign statements,
- and non-abusive contextual usage.
Inter-Annotator Agreement
Annotation quality was assessed using:
- Cohen’s kappa,
- double-coded subsets,
- adjudication procedures,
- and calibration rounds.
Substantial agreement was achieved during annotation.
Model Architecture
| Component | Details |
|---|---|
| Base model | PuoBERTa |
| Architecture family | RoBERTa |
| Task | Sequence Classification |
| Number of labels | 2 |
| Maximum sequence length | 128 |
| Framework | Hugging Face Transformers |
| Backend | PyTorch |
Training Configuration
Experimental Split
The dataset was partitioned using:
- 80% training split,
- 20% holdout test split.
Model selection was performed using:
- 5-fold stratified cross-validation on the training partition.
The final holdout set remained untouched during tuning.
Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 1e-5 |
| Weight decay | 0.01 |
| Training batch size | 16 |
| Evaluation batch size | 64 |
| Maximum length | 128 |
| Epochs | 5-fold CV + final training |
| Selection metric | Recall for offensive class |
| Loss | Class-weighted cross-entropy |
Class weights:
[1.0, 2.0]
were used to compensate for mild class imbalance and prioritise offensive recall.
Evaluation Protocol
A major contribution of this work is the use of:
Tag-Free Holdout Evaluation
Although semantic trigger tags are present during training:
- validation data is tag-free;
- holdout test data is tag-free;
- deployment assumes raw Setswana text only.
This design better reflects realistic forensic deployment conditions.
Evaluation Metrics
The following metrics were used:
- Accuracy,
- Macro F1-score,
- Matthews Correlation Coefficient (MCC),
- ROC-AUC,
- Recall for offensive class,
- confusion matrices,
- and explainability analyses.
Explainability
The model supports post-hoc explainability analysis using:
- LIME,
- S-LIME,
- token attribution analysis,
- and counterfactual evaluation.
Associated explainability resources include:
- sanitised attribution outputs,
- counterfactual flip analysis,
- token-level attribution tables,
- and LIME notebooks.
Counterfactual Analysis
Counterfactual testing was used to investigate whether the model relies solely on explicit trigger spans or also learns contextual offensive structure.
The protocol involved:
- neutralising offensive trigger phrases,
- preserving surrounding sentence context,
- and measuring whether predictions flipped.
Several non-flip cases demonstrated that contextual accusatory templates continued to influence predictions even after trigger neutralisation.
This suggests that the model internalised broader offensive semantic patterns beyond isolated trigger tokens.
Intended Use Cases
This model is intended for:
- low-resource NLP research,
- offensive-language detection,
- digital forensic investigation,
- cybercrime linguistic analysis,
- cyberbullying research,
- educational experimentation,
- explainability benchmarking,
- and forensic triage support.
Out-of-Scope Use
The model should NOT be used for:
- autonomous legal decision-making,
- punitive moderation without oversight,
- profiling individuals,
- automated criminal attribution,
- or unsupported forensic conclusions.
Human review remains essential.
Ethical Considerations
Because this work involves offensive-language analysis, careful safeguards were applied.
Measures Included
- removal of personally identifiable information,
- masking of sensitive outputs,
- sanitisation of explainability artefacts,
- and controlled release of examples.
No raw harmful dataset is redistributed in this repository.
Limitations
Several limitations remain:
- relatively small dataset size,
- evolving slang and cyberbullying terminology,
- sensitivity to sarcasm and irony,
- code-switching challenges,
- and cultural contextuality.
The model also performs binary classification only and does not explicitly distinguish between:
- phishing,
- harassment,
- threats,
- hate speech,
- or profanity subcategories.
Reproducibility
The associated repository includes:
- training notebooks,
- explainability notebooks,
- evaluation scripts,
- metrics tables,
- figure regeneration scripts,
- and environment configuration files.
Example Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "YOUR-USERNAME/YOUR-MODEL"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "O tshwanetse go tlogela boaka"
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.softmax(outputs.logits, dim=-1)
prediction = torch.argmax(probabilities, dim=-1).item()
label_map = {
0: "Non-offensive",
1: "Offensive"
}
print("Prediction:", label_map[prediction])
print("Probabilities:", probabilities)
Citation
@misc{kekgathetse2025puoberta_triggers,
title={PuoBERTa with Train-Time Semantic Triggers for Setswana Offensive Content Detection},
author={Kekgathetse, Bernerdict},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/YOUR-USERNAME/YOUR-MODEL}}
}
Related Resources
- GitHub repository: https://github.com/bkekgathetse/setswana-offensive-977
- Explainability notebooks: see repository notebooks
- Dataset documentation: see dataset repository
- Reproducibility package: Zenodo release
License
Please refer to the repository license.
Recommended:
- Code: MIT or Apache-2.0
- Documentation: CC-BY 4.0
Tags
text-classification,
offensive-language-detection,
Setswana,
tn,
low-resource-nlp,
digital-forensics,
cybersecurity,
semantic-triggers,
puoberta,
lime,
s-lime,
explainable-ai
Model tree for mopatik/PuoBERTa-offensive-language-detection-v2
Base model
dsfsi/PuoBERTa