XLM-RoBERTa for Roman Urdu Hate Speech Detection
A fine-tuned XLM-RoBERTa model for detecting hate speech and offensive content in Roman Urdu text.
Model Description
This model is based on xlm-roberta-base and has been fine-tuned on the Hate Speech Roman Urdu (HS-RU-20) dataset for binary classification:
- Label 0: Safe/Neutral content
- Label 1: Toxic/Hate/Offensive content
Model Performance
- F1-Score (Weighted): 84.15%
- Accuracy: 83.72%
- Precision: 84.69%
- Recall: 83.72%
Usage
Using Transformers Pipeline
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="WishAshake/XLM-Roberta"
)
# Classify text
result = classifier("your roman urdu text here")
print(result)
Using AutoModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("WishAshake/XLM-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("WishAshake/XLM-Roberta")
# Tokenize and predict
text = "your roman urdu text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
label = "Toxic" if predictions[0][1] > 0.5 else "Safe"
confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item()
print(f"Label: {label}, Confidence: {confidence:.4f}")
Training Details
- Base Model: xlm-roberta-base
- Training Framework: Hugging Face Transformers
- Learning Rate: 2e-5
- Batch Size: 16
- Max Sequence Length: 128
- Epochs: 5 (with early stopping)
- Optimizer: AdamW
- Mixed Precision: FP16 (when GPU available)
Dataset
The model was trained on the Hate Speech Roman Urdu (HS-RU-20) dataset, which contains:
- Text samples in Roman Urdu
- Binary labels: Safe/Neutral (0) or Toxic/Hate/Offensive (1)
Limitations
- The model is trained specifically on Roman Urdu text and may not perform well on other languages or scripts
- Performance may vary on different dialects or regional variations of Roman Urdu
- The model may have biases present in the training data
Citation
If you use this model in your research, please cite:
@misc{xlm-roberta-roman-urdu-hate-speech,
title={XLM-RoBERTa for Roman Urdu Hate Speech Detection},
author={Wisha Zahid},
year={2024},
howpublished={\url{https://huggingface.co/WishAshake/XLM-Roberta}}
}
License
This model is released under the MIT License.
Contact
For questions or issues, please open an issue on the GitHub repository.
- Downloads last month
- 2