File size: 14,043 Bytes
5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 4d07f58 5f3cea2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | # SHAP Explainability Report — Multilingual Hate Speech Detection (v1)
**Models:** All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase)
**Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model
**Background:** 200 randomly sampled training sequences
**Test samples:** 500 per language (random subset), full test set = 8,852
---
## Table of Contents
1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
3. [Strategy 1 — English → Hindi → Hinglish](#3-strategy-1--english--hindi--hinglish)
4. [Strategy 2 — English → Hinglish → Hindi](#4-strategy-2--english--hinglish--hindi)
5. [Strategy 3 — Hindi → English → Hinglish ⭐ Best](#5-strategy-3--hindi--english--hinglish--best)
6. [Strategy 4 — Hindi → Hinglish → English](#6-strategy-4--hindi--hinglish--english)
7. [Strategy 5 — Hinglish → English → Hindi](#7-strategy-5--hinglish--english--hindi)
8. [Strategy 6 — Hinglish → Hindi → English](#8-strategy-6--hinglish--hindi--english)
9. [Cross-Strategy Comparison](#9-cross-strategy-comparison)
10. [Key Findings](#10-key-findings)
---
## 1. What SHAP Is and How It Works Here
**SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"*
It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
### The Technical Challenge
Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.
**Solution — split the model into two steps:**
```
Step 1 (outside SHAP):
Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)
Step 2 (SHAP runs here):
Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
```
SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word.
### What the Numbers Mean
- **Positive SHAP value** → word pushes the prediction **toward hate speech** (output closer to 1.0)
- **Negative SHAP value** → word pushes the prediction **toward non-hate** (output closer to 0.0)
- **Magnitude** (how large the number is) → how strongly that word influences the decision
### Background Dataset
The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this baseline*.
---
## 2. How the Top-20 Words Are Selected
This is **not random**. Here is the exact process:
**Step 1 — Run SHAP on the test set**
For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.
**Step 2 — Aggregate by word across all examples**
For each unique word that appears in those 500 examples, we compute the **mean SHAP value** across every occurrence of that word:
```
word "blame" appears 47 times across 500 examples
SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
Mean SHAP = sum of all values / 47 = +0.026
```
**Step 3 — Rank by absolute mean SHAP**
Words are sorted by `|mean SHAP|` — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
**Step 4 — Show top 20**
The 20 words with the highest `|mean SHAP|` are plotted. The colour tells you the direction:
- 🔴 Red bars extending right = pushes toward **hate**
- 🔵 Blue bars extending left = pushes toward **non-hate**
**Important:** padding tokens (ID=0) are excluded. Only real words contribute.
---
## 3. Strategy 1 — English → Hindi → Hinglish
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
| Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
| Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
| Eval | Top Words Plot |
|---|---|
| English |  |
| Hindi |  |
| Hinglish |  |
| Full |  |
**Note on Hindi:** The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.
---
## 4. Strategy 2 — English → Hinglish → Hindi
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
| Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
| Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
| Eval | Top Words Plot |
|---|---|
| English |  |
| Hindi |  |
| Hinglish |  |
| Full |  |
---
## 5. Strategy 3 — Hindi → English → Hinglish ⭐ Best
This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
| Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
| Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
| Eval | Top Words Plot |
|---|---|
| English |  |
| Hindi |  |
| Hinglish |  |
| Full |  |
Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (`bacchi`, `behan` — familial insults common in South Asian abuse; `srk` — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.
---
## 6. Strategy 4 — Hindi → Hinglish → English
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
| Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
| Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
| Eval | Top Words Plot |
|---|---|
| English |  |
| Hindi |  |
| Hinglish |  |
| Full |  |
---
## 7. Strategy 5 — Hinglish → English → Hindi
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि |
| Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
| Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
| Eval | Top Words Plot |
|---|---|
| English |  |
| Hindi |  |
| Hinglish |  |
| Full |  |
**Interesting:** Hindi non-hate words in this strategy include `चूतिए` and `मुल्ले` — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.
---
## 8. Strategy 6 — Hinglish → Hindi → English
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
| Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
| Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
| Eval | Top Words Plot |
|---|---|
| English |  |
| Hindi |  |
| Hinglish |  |
| Full |  |
---
## 9. Cross-Strategy Comparison
The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.
**English test set:**

**Hindi test set:**

**Hinglish test set:**

**Full test set:**

---
## 10. Key Findings
### SHAP Magnitude Reflects Language Confidence
| Language | Typical Top |SHAP| | Why |
|---|---|---|
| English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors |
| Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns |
| Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients |
This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.
### Consistent Non-Hate Signals Across All 6 Models
- **"online"** — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
- **"rajya"** (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
- **"maine"** (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.
### Hate Speech Markers Are Linguistically Coherent
**English:** Direct accusation verbs (`blame`, `blaming`, `criticized`) and violence vocabulary (`massacres`, `cleansing`, `violence`) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.
**Hinglish:** Familial insults (`behan` = sister, `bacchi` = girl/child — used in gendered abuse), aggressive interjections (`arey`, `abb`, `ruk`), and celebrity/political names in abusive context (`srk`, `krk`, `dada`) — reflects South Asian social media abuse patterns.
**Hindi:** Body/violence metaphors (`गला` = throat, as in strangle; `मूर्ख` = fool) and politically charged proper nouns (`भूमिपूजन` = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).
### Visible Spurious Correlations
Several high-SHAP words are clearly not meaningful hate indicators:
- **`skua`** (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
- **`ahh`** — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
- **`coon`** appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).
These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.
|