|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- text-classification |
|
|
- hate-speech-detection |
|
|
- xlm-roberta |
|
|
- multilingual |
|
|
language: |
|
|
- ur |
|
|
- multilingual |
|
|
--- |
|
|
|
|
|
# XLM-RoBERTa for Roman Urdu Hate Speech Detection |
|
|
|
|
|
A fine-tuned XLM-RoBERTa model for detecting hate speech and offensive content in Roman Urdu text. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is based on **xlm-roberta-base** and has been fine-tuned on the Hate Speech Roman Urdu (HS-RU-20) dataset for binary classification: |
|
|
- **Label 0**: Safe/Neutral content |
|
|
- **Label 1**: Toxic/Hate/Offensive content |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
- **F1-Score (Weighted)**: 84.15% |
|
|
- **Accuracy**: 83.72% |
|
|
- **Precision**: 84.69% |
|
|
- **Recall**: 83.72% |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Transformers Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
classifier = pipeline( |
|
|
"text-classification", |
|
|
model="WishAshake/XLM-Roberta" |
|
|
) |
|
|
|
|
|
# Classify text |
|
|
result = classifier("your roman urdu text here") |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
### Using AutoModel |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("WishAshake/XLM-Roberta") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("WishAshake/XLM-Roberta") |
|
|
|
|
|
# Tokenize and predict |
|
|
text = "your roman urdu text here" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
|
|
|
label = "Toxic" if predictions[0][1] > 0.5 else "Safe" |
|
|
confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item() |
|
|
print(f"Label: {label}, Confidence: {confidence:.4f}") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base Model**: xlm-roberta-base |
|
|
- **Training Framework**: Hugging Face Transformers |
|
|
- **Learning Rate**: 2e-5 |
|
|
- **Batch Size**: 16 |
|
|
- **Max Sequence Length**: 128 |
|
|
- **Epochs**: 5 (with early stopping) |
|
|
- **Optimizer**: AdamW |
|
|
- **Mixed Precision**: FP16 (when GPU available) |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The model was trained on the **Hate Speech Roman Urdu (HS-RU-20)** dataset, which contains: |
|
|
- Text samples in Roman Urdu |
|
|
- Binary labels: Safe/Neutral (0) or Toxic/Hate/Offensive (1) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is trained specifically on Roman Urdu text and may not perform well on other languages or scripts |
|
|
- Performance may vary on different dialects or regional variations of Roman Urdu |
|
|
- The model may have biases present in the training data |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{xlm-roberta-roman-urdu-hate-speech, |
|
|
title={XLM-RoBERTa for Roman Urdu Hate Speech Detection}, |
|
|
author={Wisha Zahid}, |
|
|
year={2024}, |
|
|
howpublished={\url{https://huggingface.co/WishAshake/XLM-Roberta}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/WishaZahid/Roman-Urdu-Hate-Speech-using-XLM-Roberta). |
|
|
|
|
|
|