File size: 2,972 Bytes
27e0d3f
 
 
 
 
0282f29
 
 
 
 
 
 
39fdcd2
 
 
 
 
 
 
 
 
 
 
 
 
c543a62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
metrics:
- accuracy
base_model:
- unitary/toxic-bert
---
Use Model
```bash
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
identity_model = AutoModelForSequenceClassification.from_pretrained("Mridul2003/identity-hate-detector").to(device)
identity_tokenizer = AutoTokenizer.from_pretrained("Mridul2003/identity-hate-detector")
identity_inputs = identity_tokenizer(final_text, return_tensors="pt", padding=True, truncation=True)
    if 'token_type_ids' in identity_inputs:
        del identity_inputs['token_type_ids']
    identity_inputs = {k: v.to(device) for k, v in identity_inputs.items()}
    with torch.no_grad():
        identity_outputs = identity_model(**identity_inputs)
    identity_probs = torch.sigmoid(identity_outputs.logits)
    identity_prob = identity_probs[0][1].item()
    not_identity_prob = identity_probs[0][0].item()

    results["identity_hate_custom"] = identity_prob
    results["not_identity_hate_custom"] = not_identity_prob

```

# Offensive Language Classifier (Fine-Tuned on Custom Dataset)

This repository contains a fine-tuned version of the [`unitary/toxic-bert`](https://huggingface.co/unitary/toxic-bert) model for binary classification of offensive language (labels: `Offensive` vs `Not Offensive`). The model has been specifically fine-tuned on a custom dataset due to limitations observed in the base model's performance β€” particularly with `identity_hate` related content.

---

## πŸ” Problem with Base Model (`unitary/toxic-bert`)

The original `unitary/toxic-bert` model is trained for multi-label toxicity detection with 6 categories:
- toxic
- severe toxic
- obscene
- threat
- insult
- identity_hate

While it performs reasonably well on generic toxicity, **it struggles with edge cases involving identity-based hate speech** β€” often:
- Misclassifying subtle or sarcastic identity attacks
- Underestimating offensive content with identity-specific slurs

---

## βœ… Why Fine-Tune?

We fine-tuned the model on a custom annotated dataset with two clear labels:
- `0`: Not Identity Hate 
- `1`: Identity Hate

The new model simplifies the task into a **binary classification problem**, allowing more focused training for real-world moderation scenarios.

---

## πŸ“Š Dataset Overview

- Total examples: ~4,000+
- Balanced between offensive and non-offensive labels
- Contains high proportions of `identity_hate`, `obscene`, `insult`, and more nuanced samples

---

## 🧠 Model Details

- **Base model**: [`unitary/toxic-bert`](https://huggingface.co/unitary/toxic-bert)
- **Fine-tuned using**: Hugging Face πŸ€— `Trainer` API
- **Loss function**: CrossEntropyLoss (via `num_labels=2`)
- **Batch size**: 8
- **Epochs**: 3
- **Learning rate**: 2e-5

---

## πŸ”¬ Performance (Binary Classification)

| Metric   | Value   |
|----------|---------|
| Accuracy | ~92%    |
| Precision / Recall | Balanced |

---