Add SHAP summary to README and full SHAP_REPORT.md
Browse files- README.md +34 -0
- SHAP_REPORT.md +238 -0
README.md
CHANGED
|
@@ -441,6 +441,40 @@ for text, prob in zip(texts, probs):
|
|
| 441 |
|
| 442 |
---
|
| 443 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 444 |
## Citation
|
| 445 |
|
| 446 |
```
|
|
|
|
| 441 |
|
| 442 |
---
|
| 443 |
|
| 444 |
+
## Explainability — SHAP Analysis
|
| 445 |
+
|
| 446 |
+
We applied **SHAP (SHapley Additive exPlanations)** to all 6 trained models to understand which words drive hate speech predictions. A `GradientExplainer` runs on the BiLSTM sub-model (embedding layer bypassed — embeddings pre-computed as floats), with 200 background training samples. Each model is evaluated on all 4 test sets (English, Hindi, Hinglish, Full).
|
| 447 |
+
|
| 448 |
+
> Full methodology, all plots, and detailed word tables: **[SHAP_REPORT.md](SHAP_REPORT.md)**
|
| 449 |
+
|
| 450 |
+
### Best Model (Hindi → English → Hinglish) — Top SHAP Words
|
| 451 |
+
|
| 452 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 453 |
+
|---|---|---|
|
| 454 |
+
| English | credence, bj, ghazi, eni | plain, stranger, sarcasm, rubbish |
|
| 455 |
+
| Hindi | कॉल, भूमिपूजन, मूर्ख | मैसेज, पुलिसकर्मी, जाएगी |
|
| 456 |
+
| Hinglish | bacchi, bull, srk, behan | madrassa, gdp, bech |
|
| 457 |
+
| Full | skua, brut, cleansing, baar | taraf, directory, quran |
|
| 458 |
+
|
| 459 |
+

|
| 460 |
+
|
| 461 |
+

|
| 462 |
+
|
| 463 |
+
### Cross-Model Comparison (Full Test Set)
|
| 464 |
+
|
| 465 |
+
Words appearing in the top-10 of at least 3 models — shows which signals are consistent vs strategy-specific:
|
| 466 |
+
|
| 467 |
+

|
| 468 |
+
|
| 469 |
+
### Key Takeaways
|
| 470 |
+
|
| 471 |
+
- **Hindi SHAP values are 10× smaller** than English/Hinglish — confirms GloVe has near-zero Hindi coverage; the model relies on positional patterns, not semantics
|
| 472 |
+
- **"online" and "rajya"** are consistent non-hate signals across all 6 models — informational/political discussion context
|
| 473 |
+
- **Accusatory verbs** (`blame`, `blaming`, `criticized`) and **violence language** (`massacres`, `cleansing`) are the most coherent English hate markers
|
| 474 |
+
- **Spurious correlations visible** (`syntax`, `skua`, `ahh`) — expected limitation of non-contextual GloVe embeddings
|
| 475 |
+
|
| 476 |
+
---
|
| 477 |
+
|
| 478 |
## Citation
|
| 479 |
|
| 480 |
```
|
SHAP_REPORT.md
ADDED
|
@@ -0,0 +1,238 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SHAP Explainability Report — Multilingual Hate Speech Detection
|
| 2 |
+
|
| 3 |
+
**Models:** v1 (6 strategies, 8 epochs/phase) + v2 (Hinglish→Hindi→English→Full, 50 epochs/phase)
|
| 4 |
+
**Explainer:** `shap.GradientExplainer` on embedding sub-model (BiLSTM → output)
|
| 5 |
+
**Background:** 200 random training samples
|
| 6 |
+
**Test samples per eval:** 500 per language / full set
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Table of Contents
|
| 11 |
+
1. [Methodology](#1-methodology)
|
| 12 |
+
2. [v1 — All 6 Strategies](#2-v1--all-6-strategies)
|
| 13 |
+
3. [v2 — Hinglish → Hindi → English → Full](#3-v2--hinglish--hindi--english--full)
|
| 14 |
+
4. [Cross-Model Comparison](#4-cross-model-comparison)
|
| 15 |
+
5. [Key Findings](#5-key-findings)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 1. Methodology
|
| 20 |
+
|
| 21 |
+
### How SHAP Works Here
|
| 22 |
+
|
| 23 |
+
Standard SHAP cannot directly handle a Keras `Embedding` layer with integer token inputs because gradients cannot flow through integer operations. We solve this by splitting the model into two parts:
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
Original model:
|
| 27 |
+
Integer tokens (n, 100) → Embedding → (n, 100, 300) → BiLSTM → Dense → Sigmoid
|
| 28 |
+
|
| 29 |
+
SHAP approach:
|
| 30 |
+
Step 1: Manually look up embeddings: tokens → float embeddings (n, 100, 300)
|
| 31 |
+
Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
|
| 32 |
+
Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
|
| 33 |
+
Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
|
| 34 |
+
Step 5: Map token IDs → words and aggregate mean SHAP per word
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
A **positive SHAP value** means the word pushes the prediction toward **hate speech**.
|
| 38 |
+
A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
|
| 39 |
+
|
| 40 |
+
### Output per Model
|
| 41 |
+
|
| 42 |
+
For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi, Hinglish, Full):
|
| 43 |
+
- **Top-20 word importance bar chart** — `shap_topwords_{lang}.png`
|
| 44 |
+
- **Top-50 words CSV** — `shap_topwords_{lang}.csv`
|
| 45 |
+
- **Summary CSV** — top-5 hate/non-hate words per eval language
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## 2. v1 — All 6 Strategies
|
| 50 |
+
|
| 51 |
+
### Strategy 1: English → Hindi → Hinglish
|
| 52 |
+
|
| 53 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 54 |
+
|---|---|---|
|
| 55 |
+
| English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
|
| 56 |
+
| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
|
| 57 |
+
| Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
|
| 58 |
+
| Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
|
| 59 |
+
|
| 60 |
+
| Eval | Top Words Plot |
|
| 61 |
+
|---|---|
|
| 62 |
+
| English |  |
|
| 63 |
+
| Hindi |  |
|
| 64 |
+
| Hinglish |  |
|
| 65 |
+
| Full |  |
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
### Strategy 2: English → Hinglish → Hindi
|
| 70 |
+
|
| 71 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 72 |
+
|---|---|---|
|
| 73 |
+
| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
|
| 74 |
+
| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
|
| 75 |
+
| Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
|
| 76 |
+
| Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
|
| 77 |
+
|
| 78 |
+
| Eval | Top Words Plot |
|
| 79 |
+
|---|---|
|
| 80 |
+
| English |  |
|
| 81 |
+
| Hindi |  |
|
| 82 |
+
| Hinglish |  |
|
| 83 |
+
| Full |  |
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
### Strategy 3: Hindi → English → Hinglish ⭐ Best Model (v1)
|
| 88 |
+
|
| 89 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 90 |
+
|---|---|---|
|
| 91 |
+
| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
|
| 92 |
+
| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
|
| 93 |
+
| Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
|
| 94 |
+
| Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
|
| 95 |
+
|
| 96 |
+
| Eval | Top Words Plot |
|
| 97 |
+
|---|---|
|
| 98 |
+
| English |  |
|
| 99 |
+
| Hindi |  |
|
| 100 |
+
| Hinglish |  |
|
| 101 |
+
| Full |  |
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
### Strategy 4: Hindi → Hinglish → English
|
| 106 |
+
|
| 107 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 108 |
+
|---|---|---|
|
| 109 |
+
| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
|
| 110 |
+
| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर���शन, मुलाकात, उज्ज्वल |
|
| 111 |
+
| Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
|
| 112 |
+
| Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
|
| 113 |
+
|
| 114 |
+
| Eval | Top Words Plot |
|
| 115 |
+
|---|---|
|
| 116 |
+
| English |  |
|
| 117 |
+
| Hindi |  |
|
| 118 |
+
| Hinglish |  |
|
| 119 |
+
| Full |  |
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
### Strategy 5: Hinglish → English → Hindi
|
| 124 |
+
|
| 125 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 126 |
+
|---|---|---|
|
| 127 |
+
| English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
|
| 128 |
+
| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि |
|
| 129 |
+
| Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
|
| 130 |
+
| Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
|
| 131 |
+
|
| 132 |
+
| Eval | Top Words Plot |
|
| 133 |
+
|---|---|
|
| 134 |
+
| English |  |
|
| 135 |
+
| Hindi |  |
|
| 136 |
+
| Hinglish |  |
|
| 137 |
+
| Full |  |
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
### Strategy 6: Hinglish → Hindi → English
|
| 142 |
+
|
| 143 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 144 |
+
|---|---|---|
|
| 145 |
+
| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
|
| 146 |
+
| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
|
| 147 |
+
| Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
|
| 148 |
+
| Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
|
| 149 |
+
|
| 150 |
+
| Eval | Top Words Plot |
|
| 151 |
+
|---|---|
|
| 152 |
+
| English |  |
|
| 153 |
+
| Hindi |  |
|
| 154 |
+
| Hinglish |  |
|
| 155 |
+
| Full |  |
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
## 3. v2 — Hinglish → Hindi → English → Full (50 epochs)
|
| 160 |
+
|
| 161 |
+
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 162 |
+
|---|---|---|
|
| 163 |
+
| English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
|
| 164 |
+
| Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
|
| 165 |
+
| Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
|
| 166 |
+
| Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
|
| 167 |
+
|
| 168 |
+
| Eval | Top Words Plot |
|
| 169 |
+
|---|---|
|
| 170 |
+
| English |  |
|
| 171 |
+
| Hindi |  |
|
| 172 |
+
| Hinglish |  |
|
| 173 |
+
| Full |  |
|
| 174 |
+
|
| 175 |
+
---
|
| 176 |
+
|
| 177 |
+
## 4. Cross-Model Comparison
|
| 178 |
+
|
| 179 |
+
Words that appear in the top-10 SHAP list of at least 3 models, side-by-side across all strategies:
|
| 180 |
+
|
| 181 |
+
**English test set:**
|
| 182 |
+

|
| 183 |
+
|
| 184 |
+
**Hindi test set:**
|
| 185 |
+

|
| 186 |
+
|
| 187 |
+
**Hinglish test set:**
|
| 188 |
+

|
| 189 |
+
|
| 190 |
+
**Full test set:**
|
| 191 |
+

|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
## 5. Key Findings
|
| 196 |
+
|
| 197 |
+
### 5.1 SHAP Magnitude Reveals Language Confidence
|
| 198 |
+
|
| 199 |
+
Hindi SHAP values are consistently **one order of magnitude smaller** than English and Hinglish:
|
| 200 |
+
|
| 201 |
+
| Language | Typical Top SHAP | Interpretation |
|
| 202 |
+
|---|---|---|
|
| 203 |
+
| English | 0.03 – 0.08 | Model is confident — GloVe has rich English vectors |
|
| 204 |
+
| Hinglish | 0.03 – 0.07 | Model learned strong patterns despite OOV words |
|
| 205 |
+
| Hindi | 0.002 – 0.005 | Model is uncertain — most Hindi tokens have zero GloVe vectors |
|
| 206 |
+
|
| 207 |
+
This directly explains the lower accuracy and F1 on Hindi across all models.
|
| 208 |
+
|
| 209 |
+
### 5.2 Consistent Non-Hate Signals Across Models
|
| 210 |
+
|
| 211 |
+
The word **"online"** (negative SHAP) and **"rajya"** (state/parliament, negative SHAP) appear as top non-hate predictors in 4 out of 6 v1 models and v2. These represent informational/political discussion context that the model correctly distinguishes from targeted hate.
|
| 212 |
+
|
| 213 |
+
### 5.3 Hate Speech Markers Are Linguistically Coherent
|
| 214 |
+
|
| 215 |
+
- **English:** Direct slurs (`spic`, `coon`), violence language (`massacres`, `cleansing`), accusatory verbs (`blame`, `blaming`, `blamed`, `criticized`) — consistent with how hate speech presents in English social media
|
| 216 |
+
- **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
|
| 217 |
+
- **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
|
| 218 |
+
|
| 219 |
+
### 5.4 Spurious Correlations Are Visible
|
| 220 |
+
|
| 221 |
+
Several high-SHAP words are clearly spurious:
|
| 222 |
+
- **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
|
| 223 |
+
- **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
|
| 224 |
+
- **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
|
| 225 |
+
|
| 226 |
+
These spurious correlations are expected limitations of GloVe + BiLSTM — without contextual embeddings (e.g. BERT), the model cannot distinguish word meaning from co-occurrence patterns.
|
| 227 |
+
|
| 228 |
+
### 5.5 v1 vs v2 Comparison
|
| 229 |
+
|
| 230 |
+
| Aspect | v1 (8 epochs) | v2 (50 epochs) |
|
| 231 |
+
|---|---|---|
|
| 232 |
+
| English SHAP range | 0.03–0.07 | 0.02–0.06 |
|
| 233 |
+
| Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
|
| 234 |
+
| Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
|
| 235 |
+
| English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
|
| 236 |
+
| Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
|
| 237 |
+
|
| 238 |
+
v2's longer training produces slightly more semantically coherent English hate markers. The full-dataset phase in v2 notably produces **accusatory framing words** (`blamed`, `criticized`, `grown`, `advocating`) as hate predictors — reflecting that hate speech in the combined corpus often frames targets through blame/accusation rather than direct slurs.
|