XLM-RoBERTa for Roman Urdu Hate Speech Detection

A fine-tuned XLM-RoBERTa model for detecting hate speech and offensive content in Roman Urdu text.

Model Description

This model is based on xlm-roberta-base and has been fine-tuned on the Hate Speech Roman Urdu (HS-RU-20) dataset for binary classification:

Label 0: Safe/Neutral content
Label 1: Toxic/Hate/Offensive content

Model Performance

F1-Score (Weighted): 84.15%
Accuracy: 83.72%
Precision: 84.69%
Recall: 83.72%

Usage

Using Transformers Pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="WishAshake/XLM-Roberta"
)

# Classify text
result = classifier("your roman urdu text here")
print(result)

Using AutoModel

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("WishAshake/XLM-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("WishAshake/XLM-Roberta")

# Tokenize and predict
text = "your roman urdu text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

label = "Toxic" if predictions[0][1] > 0.5 else "Safe"
confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item()
print(f"Label: {label}, Confidence: {confidence:.4f}")

Training Details

Base Model: xlm-roberta-base
Training Framework: Hugging Face Transformers
Learning Rate: 2e-5
Batch Size: 16
Max Sequence Length: 128
Epochs: 5 (with early stopping)
Optimizer: AdamW
Mixed Precision: FP16 (when GPU available)

Dataset

The model was trained on the Hate Speech Roman Urdu (HS-RU-20) dataset, which contains:

Text samples in Roman Urdu
Binary labels: Safe/Neutral (0) or Toxic/Hate/Offensive (1)

Limitations

The model is trained specifically on Roman Urdu text and may not perform well on other languages or scripts
Performance may vary on different dialects or regional variations of Roman Urdu
The model may have biases present in the training data

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-roman-urdu-hate-speech,
  title={XLM-RoBERTa for Roman Urdu Hate Speech Detection},
  author={Wisha Zahid},
  year={2024},
  howpublished={\url{https://huggingface.co/WishAshake/XLM-Roberta}}
}

License

This model is released under the MIT License.

Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month: 1

Safetensors

Model size

0.3B params

Tensor type

F32