Update SHAP report — v1 isolated, full methodology
Browse files- SHAP_REPORT.md +108 -93
SHAP_REPORT.md
CHANGED
|
@@ -1,56 +1,91 @@
|
|
| 1 |
-
# SHAP Explainability Report — Multilingual Hate Speech Detection
|
| 2 |
|
| 3 |
-
**Models:**
|
| 4 |
-
**Explainer:** `shap.GradientExplainer` on
|
| 5 |
-
**Background:** 200
|
| 6 |
-
**Test samples
|
| 7 |
|
| 8 |
---
|
| 9 |
|
| 10 |
## Table of Contents
|
| 11 |
-
1. [
|
| 12 |
-
2. [
|
| 13 |
-
3. [
|
| 14 |
-
4. [
|
| 15 |
-
5. [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
-
## 1.
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
```
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
SHAP
|
| 30 |
-
|
| 31 |
-
Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
|
| 32 |
-
Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
|
| 33 |
-
Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
|
| 34 |
-
Step 5: Map token IDs → words and aggregate mean SHAP per word
|
| 35 |
```
|
| 36 |
|
| 37 |
-
|
| 38 |
-
A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
|
| 39 |
|
| 40 |
-
###
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- **
|
| 44 |
-
- **
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
---
|
| 48 |
|
| 49 |
-
## 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
##
|
| 52 |
|
| 53 |
-
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 54 |
|---|---|---|
|
| 55 |
| English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
|
| 56 |
| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
|
|
@@ -64,11 +99,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
|
|
| 64 |
| Hinglish |  |
|
| 65 |
| Full |  |
|
| 66 |
|
|
|
|
|
|
|
| 67 |
---
|
| 68 |
|
| 69 |
-
##
|
| 70 |
|
| 71 |
-
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 72 |
|---|---|---|
|
| 73 |
| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
|
| 74 |
| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
|
|
@@ -84,9 +121,11 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
|
|
| 84 |
|
| 85 |
---
|
| 86 |
|
| 87 |
-
##
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
| 90 |
|---|---|---|
|
| 91 |
| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
|
| 92 |
| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
|
|
@@ -100,11 +139,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
|
|
| 100 |
| Hinglish |  |
|
| 101 |
| Full |  |
|
| 102 |
|
|
|
|
|
|
|
| 103 |
---
|
| 104 |
|
| 105 |
-
##
|
| 106 |
|
| 107 |
-
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 108 |
|---|---|---|
|
| 109 |
| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
|
| 110 |
| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
|
|
@@ -120,9 +161,9 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
|
|
| 120 |
|
| 121 |
---
|
| 122 |
|
| 123 |
-
##
|
| 124 |
|
| 125 |
-
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 126 |
|---|---|---|
|
| 127 |
| English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
|
| 128 |
| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, ���न्मभूमि |
|
|
@@ -136,11 +177,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
|
|
| 136 |
| Hinglish |  |
|
| 137 |
| Full |  |
|
| 138 |
|
|
|
|
|
|
|
| 139 |
---
|
| 140 |
|
| 141 |
-
##
|
| 142 |
|
| 143 |
-
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 144 |
|---|---|---|
|
| 145 |
| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
|
| 146 |
| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
|
|
@@ -156,27 +199,9 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
|
|
| 156 |
|
| 157 |
---
|
| 158 |
|
| 159 |
-
##
|
| 160 |
-
|
| 161 |
-
| Eval | Top Hate Words | Top Non-Hate Words |
|
| 162 |
-
|---|---|---|
|
| 163 |
-
| English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
|
| 164 |
-
| Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
|
| 165 |
-
| Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
|
| 166 |
-
| Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
|
| 167 |
-
|
| 168 |
-
| Eval | Top Words Plot |
|
| 169 |
-
|---|---|
|
| 170 |
-
| English |  |
|
| 171 |
-
| Hindi |  |
|
| 172 |
-
| Hinglish |  |
|
| 173 |
-
| Full |  |
|
| 174 |
-
|
| 175 |
-
---
|
| 176 |
-
|
| 177 |
-
## 4. Cross-Model Comparison
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
**English test set:**
|
| 182 |

|
|
@@ -192,47 +217,37 @@ Words that appear in the top-10 SHAP list of at least 3 models, side-by-side acr
|
|
| 192 |
|
| 193 |
---
|
| 194 |
|
| 195 |
-
##
|
| 196 |
-
|
| 197 |
-
### 5.1 SHAP Magnitude Reveals Language Confidence
|
| 198 |
|
| 199 |
-
|
| 200 |
|
| 201 |
-
| Language | Typical Top SHAP |
|
| 202 |
|---|---|---|
|
| 203 |
-
| English | 0.03 – 0.08 |
|
| 204 |
-
| Hinglish | 0.03 – 0.07 |
|
| 205 |
-
| Hindi | 0.002 – 0.005 |
|
| 206 |
|
| 207 |
-
This directly explains the
|
| 208 |
|
| 209 |
-
###
|
| 210 |
|
| 211 |
-
|
|
|
|
|
|
|
| 212 |
|
| 213 |
-
###
|
| 214 |
|
| 215 |
-
|
| 216 |
-
- **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
|
| 217 |
-
- **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
|
| 218 |
|
| 219 |
-
|
| 220 |
|
| 221 |
-
|
| 222 |
-
- **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
|
| 223 |
-
- **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
|
| 224 |
-
- **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
|
| 225 |
|
| 226 |
-
|
| 227 |
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
| English SHAP range | 0.03–0.07 | 0.02–0.06 |
|
| 233 |
-
| Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
|
| 234 |
-
| Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
|
| 235 |
-
| English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
|
| 236 |
-
| Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
|
| 237 |
|
| 238 |
-
|
|
|
|
| 1 |
+
# SHAP Explainability Report — Multilingual Hate Speech Detection (v1)
|
| 2 |
|
| 3 |
+
**Models:** All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase)
|
| 4 |
+
**Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model
|
| 5 |
+
**Background:** 200 randomly sampled training sequences
|
| 6 |
+
**Test samples:** 500 per language (random subset), full test set = 8,852
|
| 7 |
|
| 8 |
---
|
| 9 |
|
| 10 |
## Table of Contents
|
| 11 |
+
1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
|
| 12 |
+
2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
|
| 13 |
+
3. [Strategy 1 — English → Hindi → Hinglish](#3-strategy-1--english--hindi--hinglish)
|
| 14 |
+
4. [Strategy 2 — English → Hinglish → Hindi](#4-strategy-2--english--hinglish--hindi)
|
| 15 |
+
5. [Strategy 3 — Hindi → English → Hinglish ⭐ Best](#5-strategy-3--hindi--english--hinglish--best)
|
| 16 |
+
6. [Strategy 4 — Hindi → Hinglish → English](#6-strategy-4--hindi--hinglish--english)
|
| 17 |
+
7. [Strategy 5 — Hinglish → English → Hindi](#7-strategy-5--hinglish--english--hindi)
|
| 18 |
+
8. [Strategy 6 — Hinglish → Hindi → English](#8-strategy-6--hinglish--hindi--english)
|
| 19 |
+
9. [Cross-Strategy Comparison](#9-cross-strategy-comparison)
|
| 20 |
+
10. [Key Findings](#10-key-findings)
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
+
## 1. What SHAP Is and How It Works Here
|
| 25 |
|
| 26 |
+
**SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"*
|
| 27 |
|
| 28 |
+
It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
|
| 29 |
+
|
| 30 |
+
### The Technical Challenge
|
| 31 |
+
|
| 32 |
+
Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.
|
| 33 |
+
|
| 34 |
+
**Solution — split the model into two steps:**
|
| 35 |
|
| 36 |
```
|
| 37 |
+
Step 1 (outside SHAP):
|
| 38 |
+
Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)
|
| 39 |
+
|
| 40 |
+
Step 2 (SHAP runs here):
|
| 41 |
+
Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
```
|
| 43 |
|
| 44 |
+
SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word.
|
|
|
|
| 45 |
|
| 46 |
+
### What the Numbers Mean
|
| 47 |
|
| 48 |
+
- **Positive SHAP value** → word pushes the prediction **toward hate speech** (output closer to 1.0)
|
| 49 |
+
- **Negative SHAP value** → word pushes the prediction **toward non-hate** (output closer to 0.0)
|
| 50 |
+
- **Magnitude** (how large the number is) → how strongly that word influences the decision
|
| 51 |
+
|
| 52 |
+
### Background Dataset
|
| 53 |
+
|
| 54 |
+
The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this baseline*.
|
| 55 |
|
| 56 |
---
|
| 57 |
|
| 58 |
+
## 2. How the Top-20 Words Are Selected
|
| 59 |
+
|
| 60 |
+
This is **not random**. Here is the exact process:
|
| 61 |
+
|
| 62 |
+
**Step 1 — Run SHAP on the test set**
|
| 63 |
+
For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.
|
| 64 |
+
|
| 65 |
+
**Step 2 — Aggregate by word across all examples**
|
| 66 |
+
For each unique word that appears in those 500 examples, we compute the **mean SHAP value** across every occurrence of that word:
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
word "blame" appears 47 times across 500 examples
|
| 70 |
+
SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
|
| 71 |
+
Mean SHAP = sum of all values / 47 = +0.026
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
**Step 3 — Rank by absolute mean SHAP**
|
| 75 |
+
Words are sorted by `|mean SHAP|` — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
|
| 76 |
+
|
| 77 |
+
**Step 4 — Show top 20**
|
| 78 |
+
The 20 words with the highest `|mean SHAP|` are plotted. The colour tells you the direction:
|
| 79 |
+
- 🔴 Red bars extending right = pushes toward **hate**
|
| 80 |
+
- 🔵 Blue bars extending left = pushes toward **non-hate**
|
| 81 |
+
|
| 82 |
+
**Important:** padding tokens (ID=0) are excluded. Only real words contribute.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
|
| 86 |
+
## 3. Strategy 1 — English → Hindi → Hinglish
|
| 87 |
|
| 88 |
+
| Eval On | Top Hate Words | Top Non-Hate Words |
|
| 89 |
|---|---|---|
|
| 90 |
| English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
|
| 91 |
| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
|
|
|
|
| 99 |
| Hinglish |  |
|
| 100 |
| Full |  |
|
| 101 |
|
| 102 |
+
**Note on Hindi:** The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.
|
| 103 |
+
|
| 104 |
---
|
| 105 |
|
| 106 |
+
## 4. Strategy 2 — English → Hinglish → Hindi
|
| 107 |
|
| 108 |
+
| Eval On | Top Hate Words | Top Non-Hate Words |
|
| 109 |
|---|---|---|
|
| 110 |
| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
|
| 111 |
| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
|
|
|
|
| 121 |
|
| 122 |
---
|
| 123 |
|
| 124 |
+
## 5. Strategy 3 — Hindi → English → Hinglish ⭐ Best
|
| 125 |
|
| 126 |
+
This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).
|
| 127 |
+
|
| 128 |
+
| Eval On | Top Hate Words | Top Non-Hate Words |
|
| 129 |
|---|---|---|
|
| 130 |
| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
|
| 131 |
| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
|
|
|
|
| 139 |
| Hinglish |  |
|
| 140 |
| Full |  |
|
| 141 |
|
| 142 |
+
Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (`bacchi`, `behan` — familial insults common in South Asian abuse; `srk` — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.
|
| 143 |
+
|
| 144 |
---
|
| 145 |
|
| 146 |
+
## 6. Strategy 4 — Hindi → Hinglish → English
|
| 147 |
|
| 148 |
+
| Eval On | Top Hate Words | Top Non-Hate Words |
|
| 149 |
|---|---|---|
|
| 150 |
| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
|
| 151 |
| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
|
|
|
|
| 161 |
|
| 162 |
---
|
| 163 |
|
| 164 |
+
## 7. Strategy 5 — Hinglish → English → Hindi
|
| 165 |
|
| 166 |
+
| Eval On | Top Hate Words | Top Non-Hate Words |
|
| 167 |
|---|---|---|
|
| 168 |
| English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
|
| 169 |
| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, ���न्मभूमि |
|
|
|
|
| 177 |
| Hinglish |  |
|
| 178 |
| Full |  |
|
| 179 |
|
| 180 |
+
**Interesting:** Hindi non-hate words in this strategy include `चूतिए` and `मुल्ले` — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.
|
| 181 |
+
|
| 182 |
---
|
| 183 |
|
| 184 |
+
## 8. Strategy 6 — Hinglish → Hindi → English
|
| 185 |
|
| 186 |
+
| Eval On | Top Hate Words | Top Non-Hate Words |
|
| 187 |
|---|---|---|
|
| 188 |
| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
|
| 189 |
| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
|
|
|
|
| 199 |
|
| 200 |
---
|
| 201 |
|
| 202 |
+
## 9. Cross-Strategy Comparison
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
+
The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.
|
| 205 |
|
| 206 |
**English test set:**
|
| 207 |

|
|
|
|
| 217 |
|
| 218 |
---
|
| 219 |
|
| 220 |
+
## 10. Key Findings
|
|
|
|
|
|
|
| 221 |
|
| 222 |
+
### SHAP Magnitude Reflects Language Confidence
|
| 223 |
|
| 224 |
+
| Language | Typical Top |SHAP| | Why |
|
| 225 |
|---|---|---|
|
| 226 |
+
| English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors |
|
| 227 |
+
| Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns |
|
| 228 |
+
| Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients |
|
| 229 |
|
| 230 |
+
This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.
|
| 231 |
|
| 232 |
+
### Consistent Non-Hate Signals Across All 6 Models
|
| 233 |
|
| 234 |
+
- **"online"** — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
|
| 235 |
+
- **"rajya"** (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
|
| 236 |
+
- **"maine"** (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.
|
| 237 |
|
| 238 |
+
### Hate Speech Markers Are Linguistically Coherent
|
| 239 |
|
| 240 |
+
**English:** Direct accusation verbs (`blame`, `blaming`, `criticized`) and violence vocabulary (`massacres`, `cleansing`, `violence`) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.
|
|
|
|
|
|
|
| 241 |
|
| 242 |
+
**Hinglish:** Familial insults (`behan` = sister, `bacchi` = girl/child — used in gendered abuse), aggressive interjections (`arey`, `abb`, `ruk`), and celebrity/political names in abusive context (`srk`, `krk`, `dada`) — reflects South Asian social media abuse patterns.
|
| 243 |
|
| 244 |
+
**Hindi:** Body/violence metaphors (`गला` = throat, as in strangle; `मूर्ख` = fool) and politically charged proper nouns (`भूमिपूजन` = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).
|
|
|
|
|
|
|
|
|
|
| 245 |
|
| 246 |
+
### Visible Spurious Correlations
|
| 247 |
|
| 248 |
+
Several high-SHAP words are clearly not meaningful hate indicators:
|
| 249 |
+
- **`skua`** (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
|
| 250 |
+
- **`ahh`** — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
|
| 251 |
+
- **`coon`** appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 252 |
|
| 253 |
+
These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.
|