Add v2 SHAP report — isolated, full methodology
Browse files- SHAP_REPORT.md +91 -170
SHAP_REPORT.md
CHANGED
|
@@ -1,238 +1,159 @@
|
|
| 1 |
-
# SHAP Explainability Report — Multilingual Hate Speech Detection
|
| 2 |
|
| 3 |
-
**
|
| 4 |
-
**Explainer:** `shap.GradientExplainer` on
|
| 5 |
-
**Background:** 200
|
| 6 |
-
**Test samples
|
| 7 |
|
| 8 |
---
|
| 9 |
|
| 10 |
## Table of Contents
|
| 11 |
-
1. [
|
| 12 |
-
2. [
|
| 13 |
-
3. [
|
| 14 |
-
4. [
|
| 15 |
-
5. [Key Findings](#5-key-findings)
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
-
## 1.
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
Original model:
|
| 27 |
-
Integer tokens (n, 100) → Embedding → (n, 100, 300) → BiLSTM → Dense → Sigmoid
|
| 28 |
-
|
| 29 |
-
SHAP approach:
|
| 30 |
-
Step 1: Manually look up embeddings: tokens → float embeddings (n, 100, 300)
|
| 31 |
-
Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
|
| 32 |
-
Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
|
| 33 |
-
Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
|
| 34 |
-
Step 5: Map token IDs → words and aggregate mean SHAP per word
|
| 35 |
-
```
|
| 36 |
|
| 37 |
-
|
| 38 |
-
A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
- **Summary CSV** — top-5 hate/non-hate words per eval language
|
| 46 |
|
| 47 |
-
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
###
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
|
| 57 |
-
| Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
|
| 58 |
-
| Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
| Hindi |  |
|
| 64 |
-
| Hinglish |  |
|
| 65 |
-
| Full |  |
|
| 66 |
|
| 67 |
---
|
| 68 |
|
| 69 |
-
##
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|---|---|---|
|
| 73 |
-
| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
|
| 74 |
-
| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
|
| 75 |
-
| Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
|
| 76 |
-
| Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
| English |  |
|
| 81 |
-
| Hindi |  |
|
| 82 |
-
| Hinglish |  |
|
| 83 |
-
| Full |  |
|
| 84 |
|
| 85 |
-
|
|
|
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
|
| 91 |
-
| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
|
| 92 |
-
| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
|
| 93 |
-
| Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
|
| 94 |
-
| Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
|
| 103 |
---
|
| 104 |
|
| 105 |
-
##
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|---|---|---|
|
| 109 |
-
| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
|
| 110 |
-
| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
|
| 111 |
-
| Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
|
| 112 |
-
| Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
|
| 113 |
|
| 114 |
-
|
|
| 115 |
|---|---|
|
| 116 |
-
|
|
| 117 |
-
| Hindi |  |
|
| 118 |
-
| Hinglish |  |
|
| 119 |
-
| Full |  |
|
| 120 |
|
| 121 |
-
|
| 122 |
|
| 123 |
-
|
| 124 |
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, ��पनिषद, जन्मभूमि |
|
| 129 |
-
| Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
|
| 130 |
-
| Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
|
| 131 |
|
| 132 |
-
|
|
| 133 |
|---|---|
|
| 134 |
-
|
|
| 135 |
-
| Hindi |  |
|
| 136 |
-
| Hinglish |  |
|
| 137 |
-
| Full |  |
|
| 138 |
|
| 139 |
-
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|---|---|---|
|
| 145 |
-
| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
|
| 146 |
-
| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
|
| 147 |
-
| Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
|
| 148 |
-
| Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
|
| 149 |
-
|
| 150 |
-
| Eval | Top Words Plot |
|
| 151 |
-
|---|---|
|
| 152 |
-
| English |  |
|
| 153 |
-
| Hindi |  |
|
| 154 |
-
| Hinglish |  |
|
| 155 |
-
| Full |  |
|
| 156 |
|
| 157 |
---
|
| 158 |
|
| 159 |
-
## 3.
|
| 160 |
-
|
| 161 |
-
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 162 |
-
|---|---|---|
|
| 163 |
-
| English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
|
| 164 |
-
| Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
|
| 165 |
-
| Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
|
| 166 |
-
| Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
|
| 167 |
|
| 168 |
-
|
|
| 169 |
|---|---|
|
| 170 |
-
|
|
| 171 |
-
| Hindi |  |
|
| 172 |
-
| Hinglish |  |
|
| 173 |
-
| Full |  |
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
|
| 182 |
-

|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
|
|
|
| 186 |
|
| 187 |
-
|
| 188 |
-

|
| 189 |
|
| 190 |
-
**
|
| 191 |
-

|
| 192 |
|
| 193 |
---
|
| 194 |
|
| 195 |
-
##
|
| 196 |
-
|
| 197 |
-
### 5.1 SHAP Magnitude Reveals Language Confidence
|
| 198 |
|
| 199 |
-
|
| 200 |
|
| 201 |
-
| Language | Typical Top SHAP |
|
| 202 |
|---|---|---|
|
| 203 |
-
| English | 0.
|
| 204 |
-
| Hinglish | 0.
|
| 205 |
-
| Hindi | 0.
|
| 206 |
|
| 207 |
-
|
| 208 |
|
| 209 |
-
###
|
| 210 |
|
| 211 |
-
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
-
###
|
| 214 |
|
| 215 |
-
|
| 216 |
-
- **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
|
| 217 |
-
- **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
|
| 218 |
|
| 219 |
-
###
|
| 220 |
|
| 221 |
-
|
| 222 |
-
- **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
|
| 223 |
-
- **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
|
| 224 |
-
- **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
|
| 225 |
-
|
| 226 |
-
These spurious correlations are expected limitations of GloVe + BiLSTM — without contextual embeddings (e.g. BERT), the model cannot distinguish word meaning from co-occurrence patterns.
|
| 227 |
-
|
| 228 |
-
### 5.5 v1 vs v2 Comparison
|
| 229 |
-
|
| 230 |
-
| Aspect | v1 (8 epochs) | v2 (50 epochs) |
|
| 231 |
-
|---|---|---|
|
| 232 |
-
| English SHAP range | 0.03–0.07 | 0.02–0.06 |
|
| 233 |
-
| Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
|
| 234 |
-
| Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
|
| 235 |
-
| English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
|
| 236 |
-
| Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
|
| 237 |
|
| 238 |
-
|
|
|
|
| 1 |
+
# SHAP Explainability Report — Multilingual Hate Speech Detection (v2)
|
| 2 |
|
| 3 |
+
**Model:** Hinglish → Hindi → English → Full (GloVe + BiLSTM, 50 epochs/phase, 200 total)
|
| 4 |
+
**Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model
|
| 5 |
+
**Background:** 200 randomly sampled training sequences
|
| 6 |
+
**Test samples:** 500 per language (random subset), full test set = 8,852
|
| 7 |
|
| 8 |
---
|
| 9 |
|
| 10 |
## Table of Contents
|
| 11 |
+
1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
|
| 12 |
+
2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
|
| 13 |
+
3. [Results by Evaluation Language](#3-results-by-evaluation-language)
|
| 14 |
+
4. [Key Findings](#4-key-findings)
|
|
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
+
## 1. What SHAP Is and How It Works Here
|
| 19 |
|
| 20 |
+
**SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"*
|
| 21 |
|
| 22 |
+
It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
|
| 23 |
|
| 24 |
+
### The Technical Challenge
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each word's influence. But you cannot take a gradient through an integer operation.
|
|
|
|
| 27 |
|
| 28 |
+
**Solution — split the model into two steps:**
|
| 29 |
|
| 30 |
+
```
|
| 31 |
+
Step 1 (outside SHAP):
|
| 32 |
+
Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)
|
|
|
|
| 33 |
|
| 34 |
+
Step 2 (SHAP runs here):
|
| 35 |
+
Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
|
| 36 |
+
```
|
| 37 |
|
| 38 |
+
SHAP's `GradientExplainer` runs on Step 2. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word — its overall contribution.
|
| 39 |
|
| 40 |
+
### What the Numbers Mean
|
| 41 |
|
| 42 |
+
- **Positive SHAP value** → word pushes prediction **toward hate speech** (output closer to 1.0)
|
| 43 |
+
- **Negative SHAP value** → word pushes prediction **toward non-hate** (output closer to 0.0)
|
| 44 |
+
- **Magnitude** → how strongly that word influences the decision
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
+
### Background Dataset
|
| 47 |
+
|
| 48 |
+
The explainer needs a "baseline" — what does the model output with no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this average*.
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
---
|
| 51 |
|
| 52 |
+
## 2. How the Top-20 Words Are Selected
|
| 53 |
|
| 54 |
+
This is **not random**. Here is the exact process:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
**Step 1 — Run SHAP on the test set**
|
| 57 |
+
For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence → 100 SHAP values.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
**Step 2 — Aggregate by word across all examples**
|
| 60 |
+
For each unique word appearing in those 500 examples, compute the **mean SHAP value** across all its occurrences:
|
| 61 |
|
| 62 |
+
```
|
| 63 |
+
word "blamed" appears 31 times across 500 examples
|
| 64 |
+
SHAP values: [+0.061, +0.058, +0.072, ..., +0.055]
|
| 65 |
+
Mean SHAP = sum / 31 = +0.064
|
| 66 |
+
```
|
| 67 |
|
| 68 |
+
**Step 3 — Rank by absolute mean SHAP**
|
| 69 |
+
Words sorted by `|mean SHAP|` — captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
+
**Step 4 — Show top 20**
|
| 72 |
+
The 20 words with the highest `|mean SHAP|` are plotted:
|
| 73 |
+
- 🔴 Red bars extending right = pushes toward **hate**
|
| 74 |
+
- 🔵 Blue bars extending left = pushes toward **non-hate**
|
| 75 |
+
|
| 76 |
+
Padding tokens (ID=0) are always excluded. Only real words contribute.
|
| 77 |
|
| 78 |
---
|
| 79 |
|
| 80 |
+
## 3. Results by Evaluation Language
|
| 81 |
|
| 82 |
+
### 3.1 English Test Set
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
+
| Top Hate Words | Top Non-Hate Words |
|
| 85 |
|---|---|
|
| 86 |
+
| nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
+

|
| 89 |
|
| 90 |
+
The English hate markers skew toward **explicit aggression** (`sicko`, `fags`) and **intentional framing** (`sabotage`, `advocating`) — the 50-epoch English phase produced sharper English feature learning than shorter-training models. The appearance of `homosexual` as a non-hate word reflects that this term appears more in informational/news text in the dataset than in targeted slurs.
|
| 91 |
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
### 3.2 Hindi Test Set
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
+
| Top Hate Words | Top Non-Hate Words |
|
| 97 |
|---|---|
|
| 98 |
+
| वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
+

|
| 101 |
|
| 102 |
+
**Note on bar size:** Hindi SHAP values are approximately 10× smaller than English (range ~0.003–0.007 vs ~0.02–0.06). This is not a model failure — it reflects that GloVe was trained on English text and has near-zero vectors for Devanagari words. The model makes Hindi predictions from positional/sequential patterns rather than word semantics. The words `गला` (throat — as in "गला घोंटना", strangle) and `ऐ` (hey — aggressive address) showing as hate markers suggest the model did capture some Hindi-specific abuse patterns through the 50-epoch Hinglish-first training.
|
| 103 |
|
| 104 |
+
Interestingly, `जिहादी` (jihadist) appears as a **non-hate** predictor — this likely reflects news/reporting usage in the dataset rather than targeted hate, and shows how context-free embeddings can produce counterintuitive results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
---
|
| 107 |
|
| 108 |
+
### 3.3 Hinglish Test Set
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
| Top Hate Words | Top Non-Hate Words |
|
| 111 |
|---|---|
|
| 112 |
+
| arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+

|
| 115 |
|
| 116 |
+
The Hinglish markers are notably **semantically coherent** compared to v1 (8-epoch) models. `arey` is a Hindi-derived exclamation almost always used in confrontational tone in social media; `bahir` (outside/get out) reflects exclusionary language; `punish` as a verb signals threat. The non-hate side shows platform/conversation context (`online`, `messages`, `mam`). Starting on Hinglish for 50 epochs produced the strongest Hinglish-specific feature learning of any model trained in this project.
|
| 117 |
|
| 118 |
+
---
|
| 119 |
|
| 120 |
+
### 3.4 Full Test Set
|
|
|
|
| 121 |
|
| 122 |
+
| Top Hate Words | Top Non-Hate Words |
|
| 123 |
+
|---|---|
|
| 124 |
+
| blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
|
| 125 |
|
| 126 |
+

|
|
|
|
| 127 |
|
| 128 |
+
The full-dataset view shows **accusatory framing** (`blamed`, `criticized`) as the dominant hate signal — more so than in any v1 model. The 50-epoch Full phase appears to have learned that hate speech in this multilingual corpus characteristically targets victims through blame and accusation, regardless of language. The spurious markers `syntax` and `sine` (very rare technical words) are outliers that co-occurred with hate content in specific training examples.
|
|
|
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
+
## 4. Key Findings
|
|
|
|
|
|
|
| 133 |
|
| 134 |
+
### SHAP Magnitude Reflects Language Confidence
|
| 135 |
|
| 136 |
+
| Language | Typical Top |SHAP| | Why |
|
| 137 |
|---|---|---|
|
| 138 |
+
| English | 0.02 – 0.06 | GloVe covers English comprehensively — strong, directional gradients |
|
| 139 |
+
| Hinglish | 0.04 – 0.46 | Roman-script words partially covered; model learned strong code-mixed patterns from Phase 1 |
|
| 140 |
+
| Hindi | 0.003 – 0.007 | Devanagari script is OOV in GloVe — near-zero vectors produce tiny gradients |
|
| 141 |
|
| 142 |
+
Hindi confidence is 10× lower than English. This is a fundamental limitation of using GloVe (English-trained) for multilingual text. A multilingual model (e.g. mBERT, MuRIL) would show much more balanced confidence across languages.
|
| 143 |
|
| 144 |
+
### Effect of 50-Epoch Training vs 8-Epoch
|
| 145 |
|
| 146 |
+
Compared to the equivalent v1 strategy (Hinglish → Hindi → English, 8 epochs), the v2 model shows:
|
| 147 |
+
- **More directional Hinglish markers** — `arey`, `bahir`, `punish` are contextually coherent; v1 had noisier Hinglish top words
|
| 148 |
+
- **Accusatory framing as primary hate signal** — `blamed`, `criticized` in full eval rather than random rare words; deeper training on the Full phase produced this generalisation
|
| 149 |
+
- **Similar English markers** — both models converge on English hate vocabulary after the English phase; the difference is mainly in non-English languages
|
| 150 |
|
| 151 |
+
### Consistent Non-Hate Signal
|
| 152 |
|
| 153 |
+
**"online"** is the most stable non-hate predictor across all four evaluation sets. It appears in informational, conversational, and reporting contexts in all three languages — the model correctly identifies this as a non-toxic context word.
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
### Spurious Correlations
|
| 156 |
|
| 157 |
+
`syntax` and `sine` appearing as hate markers in the full evaluation are clear spurious correlations — rare technical words that happened to co-occur with hateful content in a small number of training examples. The BiLSTM cannot distinguish this from meaningful signal because GloVe has no semantic grounding for these words in a hate/non-hate context.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
+
This is the primary motivation for contextual models: BERT-based approaches would represent the same word differently depending on surrounding context, eliminating most spurious co-occurrence patterns.
|