PuoBERTa with Train-Time Semantic Triggers for Setswana Offensive Content Detection

Model Overview

This repository hosts a fine-tuned version of PuoBERTa, a monolingual Setswana transformer language model derived from the RoBERTa architecture and adapted for binary offensive-content detection in Setswana digital communication.

The model was developed as part of ongoing research in:

low-resource African language NLP,
digital forensic investigation,
explainable artificial intelligence (XAI),
cybercrime linguistic analysis,
and offensive-language detection in Setswana.

The objective of the work is to investigate whether semantic trigger supervision during training can improve offensive-language detection while still allowing the model to generalise to realistic, unannotated inference conditions.

The classifier performs binary prediction:

Label	Meaning
0	Non-offensive
1	Offensive

The model is specifically designed for:

offensive-language detection,
cyberbullying analysis,
phishing/scam linguistic analysis,
online harassment detection,
explainable NLP research,
and digital forensic triage support.

Research Context

Offensive-language detection in low-resource African languages remains significantly underexplored compared to high-resource languages such as English, German, or Chinese. Existing datasets for African languages are often:

small,
weakly annotated,
heavily imbalanced,
culturally sparse,
or lacking explainability-oriented supervision.

In Setswana specifically, offensive meaning is frequently conveyed through:

implicit accusatory structures,
metaphorical expressions,
euphemisms,
slang,
code-switching,
culturally contextual phrases,
and morphologically variable abusive constructions.

Traditional sentence-level supervision alone may therefore fail to adequately highlight the semantically dominant portions of offensive expressions.

To address this challenge, this work introduces train-time semantic trigger supervision.

Semantic Trigger Supervision

Core Idea

During training, offensive spans are explicitly marked using lightweight XML-style trigger tags:

<TRIGGER> ... </TRIGGER>

These tags identify semantically important abusive or suspicious regions within a sentence.

Example:

O tshwanetse go tlogela <TRIGGER>boaka</TRIGGER>

In this example:

the full sentence provides contextual information;
the trigger span identifies the dominant offensive cue;
the model learns both:
- sentence-level classification,
- and implicit attention toward offensive semantic regions.

This approach provides weak span-level supervision without requiring a dedicated token-classification architecture.

Why Train-Time Triggers?

The motivation for semantic triggers comes from several observations:

1. Offensive Meaning Is Often Contextual

In Setswana discourse, offensiveness may emerge from:

sentence framing,
pragmatic intent,
metaphorical reference,
or accusatory structure,

rather than from isolated keywords alone.

Example:

o tla ipona

may imply:

threat,
intimidation,
or warning,

depending on surrounding context.

2. Offensive Keywords Alone Are Insufficient

Some offensive phrases contain:

benign-looking words,
culturally contextual insults,
or implicit hostility.

Pure keyword-based learning may therefore produce:

unstable generalisation,
overfitting,
or poor recall.

3. Low-Resource Datasets Need Additional Supervision Signals

Because the dataset is relatively small compared to large-scale English corpora, semantic trigger tags provide an additional learning signal that encourages the transformer to focus on linguistically meaningful abusive regions during fine-tuning.

Trigger Annotation Strategy

Trigger Span Definition

Trigger spans correspond to offensive or suspicious semantic regions such as:

insults,
threats,
phishing cues,
dehumanising phrases,
harassment expressions,
abusive metaphors,
or cyberbullying constructions.

Examples:

<TRIGGER>o sematla</TRIGGER>

ke tla go fitlhela <TRIGGER>o tla ipona</TRIGGER>

<TRIGGER>re fe dinomoro tsa gago</TRIGGER>

<TRIGGER>basadi ga ba tshwanela go tsaya ditshwetso</TRIGGER>

Important Experimental Constraint

Trigger Tags Are Used ONLY During Training

A critical aspect of this work is that semantic trigger tags are used exclusively during training.

Validation, testing, and deployment are all performed under tag-free conditions.

This means:

validation inputs contain NO trigger markers;
holdout test inputs contain NO trigger markers;
real-world inference assumes ordinary Setswana text.

Example inference input:

O tshwanetse go tlogela boaka

NOT:

O tshwanetse go tlogela <TRIGGER>boaka</TRIGGER>

This protocol prevents unrealistic dependence on artificial tags during deployment.

Experimental Hypothesis

The central research hypothesis is:

Train-time semantic trigger supervision can guide transformer attention toward linguistically meaningful offensive cues while preserving realistic inference-time behaviour under tag-free conditions.

The expectation is that trigger supervision:

improves semantic sensitivity,
improves interpretability,
improves offensive recall,
and encourages the model to internalise offensive patterns rather than memorising explicit markers.

Dataset Description

The dataset consists of manually curated Setswana social media text samples collected from publicly accessible online discourse.

The corpus contains:

Class	Count
Non-offensive	500
Offensive	477
Total	977

The data includes examples involving:

insults,
harassment,
cyberbullying,
threats,
phishing/scam language,
discriminatory expressions,
and vulgarity.

Dataset Structure

The corpus follows a CSV format inspired by established offensive-language datasets such as:

OLID,
HateCheck,
and related NLP benchmarks.

Expected structure:

TEXT,TARGET

Where:

Column	Description
TEXT	Setswana sentence or comment
TARGET	Offensive / Non-offensive label

Annotation Procedure

The dataset was manually annotated using predefined operational guidelines.

Annotation Categories

Offensive content includes:

profanity,
harassment,
hate speech,
cyberbullying,
threats,
phishing/scam expressions,
dehumanising language,
and targeted abuse.

Non-offensive content includes:

neutral discourse,
ordinary conversation,
benign statements,
and non-abusive contextual usage.

Inter-Annotator Agreement

Annotation quality was assessed using:

Cohen’s kappa,
double-coded subsets,
adjudication procedures,
and calibration rounds.

Substantial agreement was achieved during annotation.

Model Architecture

Component	Details
Base model	PuoBERTa
Architecture family	RoBERTa
Task	Sequence Classification
Number of labels	2
Maximum sequence length	128
Framework	Hugging Face Transformers
Backend	PyTorch

Training Configuration

Experimental Split

The dataset was partitioned using:

80% training split,
20% holdout test split.

Model selection was performed using:

5-fold stratified cross-validation on the training partition.

The final holdout set remained untouched during tuning.

Hyperparameters

Parameter	Value
Optimizer	AdamW
Learning rate	1e-5
Weight decay	0.01
Training batch size	16
Evaluation batch size	64
Maximum length	128
Epochs	5-fold CV + final training
Selection metric	Recall for offensive class
Loss	Class-weighted cross-entropy

Class weights:

[1.0, 2.0]

were used to compensate for mild class imbalance and prioritise offensive recall.

Evaluation Protocol

A major contribution of this work is the use of:

Tag-Free Holdout Evaluation

Although semantic trigger tags are present during training:

validation data is tag-free;
holdout test data is tag-free;
deployment assumes raw Setswana text only.

This design better reflects realistic forensic deployment conditions.

Evaluation Metrics

The following metrics were used:

Accuracy,
Macro F1-score,
Matthews Correlation Coefficient (MCC),
ROC-AUC,
Recall for offensive class,
confusion matrices,
and explainability analyses.

Explainability

The model supports post-hoc explainability analysis using:

LIME,
S-LIME,
token attribution analysis,
and counterfactual evaluation.

Associated explainability resources include:

sanitised attribution outputs,
counterfactual flip analysis,
token-level attribution tables,
and LIME notebooks.

Counterfactual Analysis

Counterfactual testing was used to investigate whether the model relies solely on explicit trigger spans or also learns contextual offensive structure.

The protocol involved:

neutralising offensive trigger phrases,
preserving surrounding sentence context,
and measuring whether predictions flipped.

Several non-flip cases demonstrated that contextual accusatory templates continued to influence predictions even after trigger neutralisation.

This suggests that the model internalised broader offensive semantic patterns beyond isolated trigger tokens.

Intended Use Cases

This model is intended for:

low-resource NLP research,
offensive-language detection,
digital forensic investigation,
cybercrime linguistic analysis,
cyberbullying research,
educational experimentation,
explainability benchmarking,
and forensic triage support.

Out-of-Scope Use

The model should NOT be used for:

autonomous legal decision-making,
punitive moderation without oversight,
profiling individuals,
automated criminal attribution,
or unsupported forensic conclusions.

Human review remains essential.

Ethical Considerations

Because this work involves offensive-language analysis, careful safeguards were applied.

Measures Included

removal of personally identifiable information,
masking of sensitive outputs,
sanitisation of explainability artefacts,
and controlled release of examples.

No raw harmful dataset is redistributed in this repository.

Limitations

Several limitations remain:

relatively small dataset size,
evolving slang and cyberbullying terminology,
sensitivity to sarcasm and irony,
code-switching challenges,
and cultural contextuality.

The model also performs binary classification only and does not explicitly distinguish between:

phishing,
harassment,
threats,
hate speech,
or profanity subcategories.

Reproducibility

The associated repository includes:

training notebooks,
explainability notebooks,
evaluation scripts,
metrics tables,
figure regeneration scripts,
and environment configuration files.

Example Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR-USERNAME/YOUR-MODEL"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "O tshwanetse go tlogela boaka"

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

probabilities = torch.softmax(outputs.logits, dim=-1)
prediction = torch.argmax(probabilities, dim=-1).item()

label_map = {
    0: "Non-offensive",
    1: "Offensive"
}

print("Prediction:", label_map[prediction])
print("Probabilities:", probabilities)

Citation

@misc{kekgathetse2025puoberta_triggers,
  title={PuoBERTa with Train-Time Semantic Triggers for Setswana Offensive Content Detection},
  author={Kekgathetse, Bernerdict},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/YOUR-USERNAME/YOUR-MODEL}}
}

Related Resources

GitHub repository: https://github.com/bkekgathetse/setswana-offensive-977
Explainability notebooks: see repository notebooks
Dataset documentation: see dataset repository
Reproducibility package: Zenodo release

License

Please refer to the repository license.

Recommended:

Code: MIT or Apache-2.0
Documentation: CC-BY 4.0