Roberta_base_model / README.md

model uploaded

879c038 verified 5 months ago

4.81 kB

	---
	language:
	- ur
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- roberta
	- urdu
	- hate-speech
	- sequence-classification
	- pytorch
	- smote
	- tf-idf
	license: other # inherit/align with base model's license
	datasets:
	- Adnan855570/urdu-hate-speech
	---

	## Urdu RoBERTa Hate Speech Classifier (Balanced)

	- Base model: `urduhack/roberta-urdu-small`
	- Task: Binary text classification (hate vs. not_hate)
	- Language: Urdu (ur)
	- Labels
	- 0 → `not_hate`
	- 1 → `hate`

	This model fine-tunes a small RoBERTa for Urdu hate-speech detection. Class imbalance was addressed by oversampling with SMOTE at the feature level (TF–IDF) prior to tokenization-based training.

	### Training data and preprocessing
	- Source dataset: `Adnan855570/urdu-hate-speech` (Excel files: `preprocessed_combined_file (1).xlsx`, `Urdu_Hate_Speech.xlsx`)
	- Columns used in notebook: `Tweet` (text), `Tag` (label in {0,1})
	- Steps:
	- TF–IDF featurization (max_features=10000)
	- SMOTE oversampling (random_state=42) to balance classes
	- Train/test split: 80/20 (random_state=42)
	- Tokenization: `AutoTokenizer.from_pretrained("urduhack/roberta-urdu-small")` with `truncation=True`, `padding=True`

	### Training setup
	- Model: `AutoModelForSequenceClassification` with `num_labels=2`
	- Device: GPU if available
	- Hyperparameters:
	- epochs: 3
	- per_device_train_batch_size: 8
	- per_device_eval_batch_size: 8
	- warmup_steps: 500
	- weight_decay: 0.01
	- evaluation_strategy: epoch
	- save_strategy: epoch
	- load_best_model_at_end: true
	- Metrics:
	- Accuracy, Precision, Recall, F1 (binary)

	### Evaluation results (test split)
	- accuracy: 0.7891
	- f1: 0.7854
	- precision: 0.8208
	- recall: 0.7529

	Note: Results derive from the balanced (SMOTE) dataset and the 80/20 split used in the notebook.

	### How to use (Transformers)

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	MODEL_ID = "Adnan855570/urdu-roberta-hate" # replace if different
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()

	id2label = model.config.id2label or {"0":"not_hate","1":"hate"}

	def predict(text: str):
	enc = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
	with torch.no_grad():
	logits = model(**enc).logits
	probs = logits.softmax(dim=-1).squeeze().tolist()
	pred = int(logits.argmax(dim=-1).item())
	return {"label_id": pred, "label": id2label.get(str(pred), str(pred)),
	"scores": {"not_hate": probs[0], "hate": probs[1]}}

	print(predict("یہ نفرت انگیز ہے یا نہیں؟"))
	```

	Or with a pipeline:

	```python
	from transformers import pipeline
	clf = pipeline("text-classification", model="Adnan855570/urdu-roberta-hate", top_k=None)
	print(clf("یہ نفرت انگیز ہے یا نہیں؟"))
	```

	### Inference API

	- cURL
	```bash
	curl -X POST -H "Authorization: Bearer $HF_TOKEN" -H "Content-Type: application/json" \
	-d '{"inputs":"یہ نفرت انگیز ہے یا نہیں؟"}' \
	https://api-inference.huggingface.co/models/Adnan855570/urdu-roberta-hate
	```

	- Python
	```python
	import os, requests
	API_URL = "https://api-inference.huggingface.co/models/Adnan855570/urdu-roberta-hate"
	HEADERS = {"Authorization": f"Bearer {os.environ.get('HF_TOKEN','')}"}
	print(requests.post(API_URL, headers=HEADERS, json={"inputs":"..."}, timeout=30).json())
	```

	### Intended uses and limitations
	- Intended:
	- Flagging potentially hateful Urdu content
	- Assisting human moderation and research
	- Limitations:
	- May misclassify satire, reclaimed slurs, or code-mixed content
	- Domain shift sensitivity (platform/community/topic)
	- Risks:
	- False positives/negatives; do not use as the sole basis for punitive actions
	- Recommendation:
	- Use with human-in-the-loop; periodically audit outcomes and bias

	### Label mapping
	Ensure the config includes:
	- `id2label = {"0":"not_hate","1":"hate"}`
	- `label2id = {"not_hate":0,"hate":1}`

	### Reproducibility notes
	- SMOTE and split seeds: `random_state=42`
	- Tokenization: truncation and padding enabled (no explicit max_length set in notebook)
	- Hardware: single GPU (e.g., Colab)

	### License
	- The model derivation should comply with the base model’s license (`urduhack/roberta-urdu-small`). Set a compatible license here once confirmed.

	### Citation
	```bibtex
	@misc{urdu_roberta_hate_balanced_2025,
	title = {Urdu RoBERTa Hate Speech Classifier (Balanced)},
	author = {Adnan},
	year = {2025},
	howpublished = {\url{https://huggingface.co/Adnan855570/urdu-roberta-hate}}
	}
	```

	### Acknowledgements
	- Base: `urduhack/roberta-urdu-small`
	- Libraries: 🤗 Transformers, Datasets, PyTorch
	- Oversampling: SMOTE (imblearn)