Mridul2003
/

identity-hate-detector

Model card Files Files and versions

identity-hate-detector / README.md

Mridul2003's picture

Update README.md

39fdcd2 verified 9 months ago

|

history blame contribute delete

2.97 kB

	---
	metrics:
	- accuracy
	base_model:
	- unitary/toxic-bert
	---
	Use Model
	```bash
	from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	identity_model = AutoModelForSequenceClassification.from_pretrained("Mridul2003/identity-hate-detector").to(device)
	identity_tokenizer = AutoTokenizer.from_pretrained("Mridul2003/identity-hate-detector")
	identity_inputs = identity_tokenizer(final_text, return_tensors="pt", padding=True, truncation=True)
	if 'token_type_ids' in identity_inputs:
	del identity_inputs['token_type_ids']
	identity_inputs = {k: v.to(device) for k, v in identity_inputs.items()}
	with torch.no_grad():
	identity_outputs = identity_model(**identity_inputs)
	identity_probs = torch.sigmoid(identity_outputs.logits)
	identity_prob = identity_probs[0][1].item()
	not_identity_prob = identity_probs[0][0].item()

	results["identity_hate_custom"] = identity_prob
	results["not_identity_hate_custom"] = not_identity_prob

	```

	# Offensive Language Classifier (Fine-Tuned on Custom Dataset)

	This repository contains a fine-tuned version of the [`unitary/toxic-bert`](https://huggingface.co/unitary/toxic-bert) model for binary classification of offensive language (labels: `Offensive` vs `Not Offensive`). The model has been specifically fine-tuned on a custom dataset due to limitations observed in the base model's performance — particularly with `identity_hate` related content.

	---

	## 🔍 Problem with Base Model (`unitary/toxic-bert`)

	The original `unitary/toxic-bert` model is trained for multi-label toxicity detection with 6 categories:
	- toxic
	- severe toxic
	- obscene
	- threat
	- insult
	- identity_hate

	While it performs reasonably well on generic toxicity, it struggles with edge cases involving identity-based hate speech — often:
	- Misclassifying subtle or sarcastic identity attacks
	- Underestimating offensive content with identity-specific slurs

	---

	## ✅ Why Fine-Tune?

	We fine-tuned the model on a custom annotated dataset with two clear labels:
	- `0`: Not Identity Hate
	- `1`: Identity Hate

	The new model simplifies the task into a binary classification problem, allowing more focused training for real-world moderation scenarios.

	---

	## 📊 Dataset Overview

	- Total examples: ~4,000+
	- Balanced between offensive and non-offensive labels
	- Contains high proportions of `identity_hate`, `obscene`, `insult`, and more nuanced samples

	---

	## 🧠 Model Details

	- Base model: [`unitary/toxic-bert`](https://huggingface.co/unitary/toxic-bert)
	- Fine-tuned using: Hugging Face 🤗 `Trainer` API
	- Loss function: CrossEntropyLoss (via `num_labels=2`)
	- Batch size: 8
	- Epochs: 3
	- Learning rate: 2e-5

	---

	## 🔬 Performance (Binary Classification)

	\| Metric \| Value \|
	\|----------\|---------\|
	\| Accuracy \| ~92% \|
	\| Precision / Recall \| Balanced \|

	---