WishAshake
/

XLM-Roberta

Text Classification

hate-speech-detection

Model card Files Files and versions

XLM-Roberta / README.md

WishAshake's picture

Add model card documentation

4d677d6 verified about 1 month ago

|

history blame contribute delete

2.93 kB

	---
	license: mit
	tags:
	- text-classification
	- hate-speech-detection
	- xlm-roberta
	- multilingual
	language:
	- ur
	- multilingual
	---

	# XLM-RoBERTa for Roman Urdu Hate Speech Detection

	A fine-tuned XLM-RoBERTa model for detecting hate speech and offensive content in Roman Urdu text.

	## Model Description

	This model is based on xlm-roberta-base and has been fine-tuned on the Hate Speech Roman Urdu (HS-RU-20) dataset for binary classification:
	- Label 0: Safe/Neutral content
	- Label 1: Toxic/Hate/Offensive content

	## Model Performance

	- F1-Score (Weighted): 84.15%
	- Accuracy: 83.72%
	- Precision: 84.69%
	- Recall: 83.72%

	## Usage

	### Using Transformers Pipeline

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="WishAshake/XLM-Roberta"
	)

	# Classify text
	result = classifier("your roman urdu text here")
	print(result)
	```

	### Using AutoModel

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("WishAshake/XLM-Roberta")
	model = AutoModelForSequenceClassification.from_pretrained("WishAshake/XLM-Roberta")

	# Tokenize and predict
	text = "your roman urdu text here"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

	label = "Toxic" if predictions[0][1] > 0.5 else "Safe"
	confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item()
	print(f"Label: {label}, Confidence: {confidence:.4f}")
	```

	## Training Details

	- Base Model: xlm-roberta-base
	- Training Framework: Hugging Face Transformers
	- Learning Rate: 2e-5
	- Batch Size: 16
	- Max Sequence Length: 128
	- Epochs: 5 (with early stopping)
	- Optimizer: AdamW
	- Mixed Precision: FP16 (when GPU available)

	## Dataset

	The model was trained on the Hate Speech Roman Urdu (HS-RU-20) dataset, which contains:
	- Text samples in Roman Urdu
	- Binary labels: Safe/Neutral (0) or Toxic/Hate/Offensive (1)

	## Limitations

	- The model is trained specifically on Roman Urdu text and may not perform well on other languages or scripts
	- Performance may vary on different dialects or regional variations of Roman Urdu
	- The model may have biases present in the training data

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{xlm-roberta-roman-urdu-hate-speech,
	title={XLM-RoBERTa for Roman Urdu Hate Speech Detection},
	author={Wisha Zahid},
	year={2024},
	howpublished={\url{https://huggingface.co/WishAshake/XLM-Roberta}}
	}
	```

	## License

	This model is released under the MIT License.

	## Contact

	For questions or issues, please open an issue on the [GitHub repository](https://github.com/WishaZahid/Roman-Urdu-Hate-Speech-using-XLM-Roberta).