XLM-RoBERTa for Roman Urdu Hate Speech Detection

A fine-tuned XLM-RoBERTa model for detecting hate speech and offensive content in Roman Urdu text.

Model Description

This model is based on xlm-roberta-base and has been fine-tuned on the Hate Speech Roman Urdu (HS-RU-20) dataset for binary classification:

  • Label 0: Safe/Neutral content
  • Label 1: Toxic/Hate/Offensive content

Model Performance

  • F1-Score (Weighted): 84.15%
  • Accuracy: 83.72%
  • Precision: 84.69%
  • Recall: 83.72%

Usage

Using Transformers Pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="WishAshake/XLM-Roberta"
)

# Classify text
result = classifier("your roman urdu text here")
print(result)

Using AutoModel

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("WishAshake/XLM-Roberta")
model = AutoModelForSequenceClassification.from_pretrained("WishAshake/XLM-Roberta")

# Tokenize and predict
text = "your roman urdu text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

label = "Toxic" if predictions[0][1] > 0.5 else "Safe"
confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item()
print(f"Label: {label}, Confidence: {confidence:.4f}")

Training Details

  • Base Model: xlm-roberta-base
  • Training Framework: Hugging Face Transformers
  • Learning Rate: 2e-5
  • Batch Size: 16
  • Max Sequence Length: 128
  • Epochs: 5 (with early stopping)
  • Optimizer: AdamW
  • Mixed Precision: FP16 (when GPU available)

Dataset

The model was trained on the Hate Speech Roman Urdu (HS-RU-20) dataset, which contains:

  • Text samples in Roman Urdu
  • Binary labels: Safe/Neutral (0) or Toxic/Hate/Offensive (1)

Limitations

  • The model is trained specifically on Roman Urdu text and may not perform well on other languages or scripts
  • Performance may vary on different dialects or regional variations of Roman Urdu
  • The model may have biases present in the training data

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-roman-urdu-hate-speech,
  title={XLM-RoBERTa for Roman Urdu Hate Speech Detection},
  author={Wisha Zahid},
  year={2024},
  howpublished={\url{https://huggingface.co/WishAshake/XLM-Roberta}}
}

License

This model is released under the MIT License.

Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month
2
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support