XLM-Roberta / README.md
WishAshake's picture
Add model card documentation
4d677d6 verified
---
license: mit
tags:
- text-classification
- hate-speech-detection
- xlm-roberta
- multilingual
language:
- ur
- multilingual
---
# XLM-RoBERTa for Roman Urdu Hate Speech Detection
A fine-tuned XLM-RoBERTa model for detecting hate speech and offensive content in Roman Urdu text.
## Model Description
This model is based on **xlm-roberta-base** and has been fine-tuned on the Hate Speech Roman Urdu (HS-RU-20) dataset for binary classification:
- **Label 0**: Safe/Neutral content
- **Label 1**: Toxic/Hate/Offensive content
## Model Performance
- **F1-Score (Weighted)**: 84.15%
- **Accuracy**: 83.72%
- **Precision**: 84.69%
- **Recall**: 83.72%
## Usage
### Using Transformers Pipeline
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="WishAshake/XLM-Roberta"
)
# Classify text
result = classifier("your roman urdu text here")
print(result)
```
### Using AutoModel
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("WishAshake/XLM-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("WishAshake/XLM-Roberta")
# Tokenize and predict
text = "your roman urdu text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
label = "Toxic" if predictions[0][1] > 0.5 else "Safe"
confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item()
print(f"Label: {label}, Confidence: {confidence:.4f}")
```
## Training Details
- **Base Model**: xlm-roberta-base
- **Training Framework**: Hugging Face Transformers
- **Learning Rate**: 2e-5
- **Batch Size**: 16
- **Max Sequence Length**: 128
- **Epochs**: 5 (with early stopping)
- **Optimizer**: AdamW
- **Mixed Precision**: FP16 (when GPU available)
## Dataset
The model was trained on the **Hate Speech Roman Urdu (HS-RU-20)** dataset, which contains:
- Text samples in Roman Urdu
- Binary labels: Safe/Neutral (0) or Toxic/Hate/Offensive (1)
## Limitations
- The model is trained specifically on Roman Urdu text and may not perform well on other languages or scripts
- Performance may vary on different dialects or regional variations of Roman Urdu
- The model may have biases present in the training data
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{xlm-roberta-roman-urdu-hate-speech,
title={XLM-RoBERTa for Roman Urdu Hate Speech Detection},
author={Wisha Zahid},
year={2024},
howpublished={\url{https://huggingface.co/WishAshake/XLM-Roberta}}
}
```
## License
This model is released under the MIT License.
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/WishaZahid/Roman-Urdu-Hate-Speech-using-XLM-Roberta).