File size: 4,212 Bytes
061c06e
457f484
061c06e
 
 
 
87c4532
 
061c06e
 
 
d71c4b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
061c06e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
457f484
061c06e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: Toxic Comment Classifier & Explainer
emoji: πŸ§ͺ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 4.44.1
python_version: "3.10"
app_file: app.py
pinned: true
license: mit

# πŸ‘‡ New fields
description: >
  A multi-label transformer-based Toxic Comment Classifier trained on the Jigsaw dataset.
  It includes an explainability module (Captum Integrated Gradients) that visualizes
  which words contribute most to each toxic label, powered by Gradio UI.
tags:
  - text-classification
  - multi-label
  - explainable-ai
  - transformers
  - gradio
  - distilbert
  - nlp
  - toxicity-detection
  - huggingface-space
---

# 🧠 Toxic Comment Classification β€” Explainable Multi-Label NLP Model

<p align="center">
  <img src="banner.png" alt="Toxic Comment Classification Banner" width="100%">
</p>

<p align="center">
  <b>DistilBERT-based multi-label classifier for detecting toxic online comments with explainability powered by Captum Integrated Gradients (IG).</b>
</p>

---

## πŸš€ Overview

This project presents an **explainable AI system** for identifying toxic comments in text, built using a fine-tuned Transformer model (DistilBERT).  
It performs **multi-label classification** across six toxicity categories while offering **token-level explanations** for each prediction.

### 🧩 Labels
- toxic  
- severe_toxic  
- obscene  
- threat  
- insult  
- identity_hate  

### 🎯 Objectives
- Fine-tune DistilBERT for robust multi-label toxicity detection  
- Enhance interpretability using **Captum Integrated Gradients**  
- Deploy a real-time, user-friendly **Gradio interface**  

---

## πŸ§ͺ How to Use the Demo

1. Type or paste any comment in the text box  
2. Click **β€œClassify”** to view per-label probabilities and predictions  
3. Open the **β€œExplain”** tab β†’ select a target label  
4. Generate a heatmap showing which words **support (red)** or **oppose (blue)** the decision  

---

## 🧠 Example Inputs

| Example | Expected Labels |
|----------|------------------|
| β€œYou are a complete idiot.” | toxic / insult |
| β€œI will kill you tomorrow.” | threat / toxic |
| β€œThanks for your help today!” | non-toxic |
| β€œGo away, you people don’t belong here.” | identity_hate / insult |

---

## βš™οΈ Technical Stack

| Component | Technology |
|------------|-------------|
| **Language Model** | DistilBERT (`distilbert-base-uncased`) |
| **Frameworks** | PyTorch β€’ Transformers β€’ Gradio |
| **Explainability** | Captum (Integrated Gradients) |
| **Training** | Stratified splits β€’ Early Stopping β€’ Regularization |
| **Visualization** | Gradio UI + Captum HTML heatmaps |
| **Deployment** | Hugging Face Spaces |

---

## πŸ“‚ Project Structure

```
.
β”œβ”€β”€ app.py                # Gradio app entry point
β”œβ”€β”€ requirements.txt      # Runtime dependencies
β”œβ”€β”€ artifacts/
β”‚   β”œβ”€β”€ best/             # Fine-tuned model weights + tokenizer
β”‚   └── thresholds.json   # Tuned thresholds for each label
└── README.md             # (this file)
```

---

## πŸ“Š Model Training Summary

- Dataset: [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge)  
- Tokenization: DistilBERT (max length = 256)  
- Loss: Binary Cross-Entropy with Logits (BCEWithLogitsLoss)  
- Optimizer: AdamW (learning rate = 2e-5, weight decay = 0.02)  
- Regularization: Dropout (head=0.5, encoder=0.2)  
- Evaluation Metrics: Macro F1 β€’ Precision β€’ Recall β€’ AUC  
- Explainability: Captum Layer Integrated Gradients (LIG)

---

## πŸ–₯️ Live Demo

> πŸš€ Try the interactive demo on Hugging Face Spaces:  
> πŸ”— **[yaekobB / Toxic-Comment-Classification](https://huggingface.co/spaces/yaekobB/Toxic-Comment-Classification)**

---

## 🧰 Dependencies

```txt
transformers>=4.41.0
torch>=2.2.0
safetensors>=0.4.2
gradio>=4.20.0
captum>=0.7.0
pandas>=2.0.0
numpy>=1.24.0
```

---

---

## πŸͺͺ License

This project is licensed under the **MIT License**.  
You are free to use, modify, and distribute this work with attribution.

---

<p align="center">
  <i>β€œBuilding safer and explainable AI for online interactions.”</i>
</p>