Update README.md
Browse files
README.md
CHANGED
|
@@ -10,4 +10,65 @@ from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassifica
|
|
| 10 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 11 |
identity_model = AutoModelForSequenceClassification.from_pretrained("Mridul2003/identity-hate-detector").to(device)
|
| 12 |
identity_tokenizer = AutoTokenizer.from_pretrained("Mridul2003/identity-hate-detector")
|
| 13 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 11 |
identity_model = AutoModelForSequenceClassification.from_pretrained("Mridul2003/identity-hate-detector").to(device)
|
| 12 |
identity_tokenizer = AutoTokenizer.from_pretrained("Mridul2003/identity-hate-detector")
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
# Offensive Language Classifier (Fine-Tuned on Custom Dataset)
|
| 16 |
+
|
| 17 |
+
This repository contains a fine-tuned version of the [`unitary/toxic-bert`](https://huggingface.co/unitary/toxic-bert) model for binary classification of offensive language (labels: `Offensive` vs `Not Offensive`). The model has been specifically fine-tuned on a custom dataset due to limitations observed in the base model's performance — particularly with `identity_hate` related content.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 🔍 Problem with Base Model (`unitary/toxic-bert`)
|
| 22 |
+
|
| 23 |
+
The original `unitary/toxic-bert` model is trained for multi-label toxicity detection with 6 categories:
|
| 24 |
+
- toxic
|
| 25 |
+
- severe toxic
|
| 26 |
+
- obscene
|
| 27 |
+
- threat
|
| 28 |
+
- insult
|
| 29 |
+
- identity_hate
|
| 30 |
+
|
| 31 |
+
While it performs reasonably well on generic toxicity, **it struggles with edge cases involving identity-based hate speech** — often:
|
| 32 |
+
- Misclassifying subtle or sarcastic identity attacks
|
| 33 |
+
- Underestimating offensive content with identity-specific slurs
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## ✅ Why Fine-Tune?
|
| 38 |
+
|
| 39 |
+
We fine-tuned the model on a custom annotated dataset with two clear labels:
|
| 40 |
+
- `0`: Not Identity Hate
|
| 41 |
+
- `1`: Identity Hate
|
| 42 |
+
|
| 43 |
+
The new model simplifies the task into a **binary classification problem**, allowing more focused training for real-world moderation scenarios.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 📊 Dataset Overview
|
| 48 |
+
|
| 49 |
+
- Total examples: ~4,000+
|
| 50 |
+
- Balanced between offensive and non-offensive labels
|
| 51 |
+
- Contains high proportions of `identity_hate`, `obscene`, `insult`, and more nuanced samples
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
## 🧠 Model Details
|
| 56 |
+
|
| 57 |
+
- **Base model**: [`unitary/toxic-bert`](https://huggingface.co/unitary/toxic-bert)
|
| 58 |
+
- **Fine-tuned using**: Hugging Face 🤗 `Trainer` API
|
| 59 |
+
- **Loss function**: CrossEntropyLoss (via `num_labels=2`)
|
| 60 |
+
- **Batch size**: 8
|
| 61 |
+
- **Epochs**: 3
|
| 62 |
+
- **Learning rate**: 2e-5
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## 🔬 Performance (Binary Classification)
|
| 67 |
+
|
| 68 |
+
| Metric | Value |
|
| 69 |
+
|----------|---------|
|
| 70 |
+
| Accuracy | ~92% |
|
| 71 |
+
| Precision / Recall | Balanced |
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|