mopatik/setswana-offensive-977
Preview • Updated • 6
A fine-tuned transformer model based on PuoBERTa for binary classification of Setswana text into:
The model is intended for digital forensic investigations and cybercrime analysis involving Setswana-language social media text.
Not intended for:
roberta-base architecture (PuoBERTa variant)[1.0, 2.0]The model was evaluated on a held-out 20% test set that was kept completely unseen during training and cross-validation.
The following metrics were computed:
| Metric | Value |
|---|---|
| Accuracy | 0.8673 |
| Macro F1-score | 0.8662 |
| Recall (Offensive = 1) | 0.8444 |
| Matthews Correlation Coefficient (MCC) | 0.7326 |
| ROC-AUC | 0.9288 |
| Loss | 0.3381 |
| Runtime (seconds) | 0.5897 |
| Samples per second | 332.398 |
| Steps per second | 6.784 |
Overall, the model demonstrates high discriminative power and reliable generalisation for Setswana offensive-language detection.
from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("mopatik/PuoBERTa-offensive-detection-v1")
model = RobertaForSequenceClassification.from_pretrained("mopatik/PuoBERTa-offensive-detection-v1")
# Ensure model is in evaluation mode
model.eval()
# Sample text (replace with your actual text)
#sample_text = "o seso tota" # Example Setswana text
sample_text = "modimo a le segofatse" # Example Setswana text
# Tokenize and prepare input
inputs = tokenizer(
sample_text,
padding='max_length',
truncation=True,
max_length=128,
return_tensors="pt"
)
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
predicted_class = torch.argmax(probs).item()
# Get class label and confidence
class_names = ["Non-offensive", "Offensive"]
confidence = probs[0][predicted_class].item()
print(f"Text: {sample_text}")
print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})")
print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}")
Base model
dsfsi/PuoBERTa