File size: 2,933 Bytes
4d677d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: mit
tags:
- text-classification
- hate-speech-detection
- xlm-roberta
- multilingual
language:
- ur
- multilingual
---

# XLM-RoBERTa for Roman Urdu Hate Speech Detection

A fine-tuned XLM-RoBERTa model for detecting hate speech and offensive content in Roman Urdu text.

## Model Description

This model is based on **xlm-roberta-base** and has been fine-tuned on the Hate Speech Roman Urdu (HS-RU-20) dataset for binary classification:
- **Label 0**: Safe/Neutral content
- **Label 1**: Toxic/Hate/Offensive content

## Model Performance

- **F1-Score (Weighted)**: 84.15%
- **Accuracy**: 83.72%
- **Precision**: 84.69%
- **Recall**: 83.72%

## Usage

### Using Transformers Pipeline

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="WishAshake/XLM-Roberta"
)

# Classify text
result = classifier("your roman urdu text here")
print(result)
```

### Using AutoModel

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("WishAshake/XLM-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("WishAshake/XLM-Roberta")

# Tokenize and predict
text = "your roman urdu text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

label = "Toxic" if predictions[0][1] > 0.5 else "Safe"
confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item()
print(f"Label: {label}, Confidence: {confidence:.4f}")
```

## Training Details

- **Base Model**: xlm-roberta-base
- **Training Framework**: Hugging Face Transformers
- **Learning Rate**: 2e-5
- **Batch Size**: 16
- **Max Sequence Length**: 128
- **Epochs**: 5 (with early stopping)
- **Optimizer**: AdamW
- **Mixed Precision**: FP16 (when GPU available)

## Dataset

The model was trained on the **Hate Speech Roman Urdu (HS-RU-20)** dataset, which contains:
- Text samples in Roman Urdu
- Binary labels: Safe/Neutral (0) or Toxic/Hate/Offensive (1)

## Limitations

- The model is trained specifically on Roman Urdu text and may not perform well on other languages or scripts
- Performance may vary on different dialects or regional variations of Roman Urdu
- The model may have biases present in the training data

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{xlm-roberta-roman-urdu-hate-speech,
  title={XLM-RoBERTa for Roman Urdu Hate Speech Detection},
  author={Wisha Zahid},
  year={2024},
  howpublished={\url{https://huggingface.co/WishAshake/XLM-Roberta}}
}
```

## License

This model is released under the MIT License.

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/WishaZahid/Roman-Urdu-Hate-Speech-using-XLM-Roberta).