tuklu commited on
Commit
dfa11db
·
verified ·
1 Parent(s): f3c157a

Add SHAP summary to README and full SHAP_REPORT.md

Browse files
Files changed (2) hide show
  1. README.md +33 -0
  2. SHAP_REPORT.md +238 -0
README.md CHANGED
@@ -301,6 +301,39 @@ for text, prob in zip(texts, probs):
301
 
302
  ---
303
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304
  ## Related
305
 
306
  - **v1 (all 6 strategies, 8 epochs each):** [tuklu/SASC](https://huggingface.co/tuklu/SASC)
 
301
 
302
  ---
303
 
304
+ ## Explainability — SHAP Analysis
305
+
306
+ We applied **SHAP (SHapley Additive exPlanations)** to the final trained model to understand which words drive hate speech predictions. A `GradientExplainer` runs on the BiLSTM sub-model (embedding layer bypassed — embeddings pre-computed as floats), with 200 background training samples, evaluated on all 4 test sets.
307
+
308
+ > Full methodology, all strategy comparisons, and detailed word tables: **[SHAP_REPORT.md](SHAP_REPORT.md)**
309
+
310
+ ### Top SHAP Words — Final Model
311
+
312
+ | Eval | Top Hate Words | Top Non-Hate Words |
313
+ |---|---|---|
314
+ | English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
315
+ | Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक |
316
+ | Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
317
+ | Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
318
+
319
+ ![SHAP — English](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png)
320
+
321
+ ![SHAP — Hinglish](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png)
322
+
323
+ ![SHAP — Hindi](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png)
324
+
325
+ ![SHAP — Full](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png)
326
+
327
+ ### Key Takeaways
328
+
329
+ - **Hindi SHAP values are 10× smaller** than English/Hinglish — GloVe has near-zero Hindi coverage; model relies on positional patterns, not word semantics
330
+ - **Accusatory framing dominates full-dataset hate markers** (`blamed`, `criticized`, `advocating`) — the 50-epoch Full phase learns that hate speech in this corpus often targets victims through blame/accusation rather than direct slurs
331
+ - **"online"** is the most consistent non-hate signal — informational/conversational context across all three languages
332
+ - **Hinglish markers are semantically coherent** (`arey` = hey/exclamation in abusive context, `punish`, `interior`) despite code-mixing — v2's 50 epochs on Hinglish-first produced stronger Hinglish feature learning than v1
333
+ - **Spurious correlations remain** (`syntax`, `sine`) — inherent limitation of non-contextual GloVe; a BERT-based model would resolve these
334
+
335
+ ---
336
+
337
  ## Related
338
 
339
  - **v1 (all 6 strategies, 8 epochs each):** [tuklu/SASC](https://huggingface.co/tuklu/SASC)
SHAP_REPORT.md ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SHAP Explainability Report — Multilingual Hate Speech Detection
2
+
3
+ **Models:** v1 (6 strategies, 8 epochs/phase) + v2 (Hinglish→Hindi→English→Full, 50 epochs/phase)
4
+ **Explainer:** `shap.GradientExplainer` on embedding sub-model (BiLSTM → output)
5
+ **Background:** 200 random training samples
6
+ **Test samples per eval:** 500 per language / full set
7
+
8
+ ---
9
+
10
+ ## Table of Contents
11
+ 1. [Methodology](#1-methodology)
12
+ 2. [v1 — All 6 Strategies](#2-v1--all-6-strategies)
13
+ 3. [v2 — Hinglish → Hindi → English → Full](#3-v2--hinglish--hindi--english--full)
14
+ 4. [Cross-Model Comparison](#4-cross-model-comparison)
15
+ 5. [Key Findings](#5-key-findings)
16
+
17
+ ---
18
+
19
+ ## 1. Methodology
20
+
21
+ ### How SHAP Works Here
22
+
23
+ Standard SHAP cannot directly handle a Keras `Embedding` layer with integer token inputs because gradients cannot flow through integer operations. We solve this by splitting the model into two parts:
24
+
25
+ ```
26
+ Original model:
27
+ Integer tokens (n, 100) → Embedding → (n, 100, 300) → BiLSTM → Dense → Sigmoid
28
+
29
+ SHAP approach:
30
+ Step 1: Manually look up embeddings: tokens → float embeddings (n, 100, 300)
31
+ Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
32
+ Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
33
+ Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
34
+ Step 5: Map token IDs → words and aggregate mean SHAP per word
35
+ ```
36
+
37
+ A **positive SHAP value** means the word pushes the prediction toward **hate speech**.
38
+ A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
39
+
40
+ ### Output per Model
41
+
42
+ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi, Hinglish, Full):
43
+ - **Top-20 word importance bar chart** — `shap_topwords_{lang}.png`
44
+ - **Top-50 words CSV** — `shap_topwords_{lang}.csv`
45
+ - **Summary CSV** — top-5 hate/non-hate words per eval language
46
+
47
+ ---
48
+
49
+ ## 2. v1 — All 6 Strategies
50
+
51
+ ### Strategy 1: English → Hindi → Hinglish
52
+
53
+ | Eval | Top Hate Words | Top Non-Hate Words |
54
+ |---|---|---|
55
+ | English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
56
+ | Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
57
+ | Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
58
+ | Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
59
+
60
+ | Eval | Top Words Plot |
61
+ |---|---|
62
+ | English | ![](shap/english_to_hindi_to_hinglish/shap_topwords_english.png) |
63
+ | Hindi | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hindi.png) |
64
+ | Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
65
+ | Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
66
+
67
+ ---
68
+
69
+ ### Strategy 2: English → Hinglish → Hindi
70
+
71
+ | Eval | Top Hate Words | Top Non-Hate Words |
72
+ |---|---|---|
73
+ | English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
74
+ | Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
75
+ | Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
76
+ | Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
77
+
78
+ | Eval | Top Words Plot |
79
+ |---|---|
80
+ | English | ![](shap/english_to_hinglish_to_hindi/shap_topwords_english.png) |
81
+ | Hindi | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hindi.png) |
82
+ | Hinglish | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hinglish.png) |
83
+ | Full | ![](shap/english_to_hinglish_to_hindi/shap_topwords_full.png) |
84
+
85
+ ---
86
+
87
+ ### Strategy 3: Hindi → English → Hinglish ⭐ Best Model (v1)
88
+
89
+ | Eval | Top Hate Words | Top Non-Hate Words |
90
+ |---|---|---|
91
+ | English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
92
+ | Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
93
+ | Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
94
+ | Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
95
+
96
+ | Eval | Top Words Plot |
97
+ |---|---|
98
+ | English | ![](shap/hindi_to_english_to_hinglish/shap_topwords_english.png) |
99
+ | Hindi | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hindi.png) |
100
+ | Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
101
+ | Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
102
+
103
+ ---
104
+
105
+ ### Strategy 4: Hindi → Hinglish → English
106
+
107
+ | Eval | Top Hate Words | Top Non-Hate Words |
108
+ |---|---|---|
109
+ | English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
110
+ | Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर���शन, मुलाकात, उज्ज्वल |
111
+ | Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
112
+ | Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
113
+
114
+ | Eval | Top Words Plot |
115
+ |---|---|
116
+ | English | ![](shap/hindi_to_hinglish_to_english/shap_topwords_english.png) |
117
+ | Hindi | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hindi.png) |
118
+ | Hinglish | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hinglish.png) |
119
+ | Full | ![](shap/hindi_to_hinglish_to_english/shap_topwords_full.png) |
120
+
121
+ ---
122
+
123
+ ### Strategy 5: Hinglish → English → Hindi
124
+
125
+ | Eval | Top Hate Words | Top Non-Hate Words |
126
+ |---|---|---|
127
+ | English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
128
+ | Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि |
129
+ | Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
130
+ | Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
131
+
132
+ | Eval | Top Words Plot |
133
+ |---|---|
134
+ | English | ![](shap/hinglish_to_english_to_hindi/shap_topwords_english.png) |
135
+ | Hindi | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hindi.png) |
136
+ | Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
137
+ | Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
138
+
139
+ ---
140
+
141
+ ### Strategy 6: Hinglish → Hindi → English
142
+
143
+ | Eval | Top Hate Words | Top Non-Hate Words |
144
+ |---|---|---|
145
+ | English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
146
+ | Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
147
+ | Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
148
+ | Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
149
+
150
+ | Eval | Top Words Plot |
151
+ |---|---|
152
+ | English | ![](shap/hinglish_to_hindi_to_english/shap_topwords_english.png) |
153
+ | Hindi | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hindi.png) |
154
+ | Hinglish | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hinglish.png) |
155
+ | Full | ![](shap/hinglish_to_hindi_to_english/shap_topwords_full.png) |
156
+
157
+ ---
158
+
159
+ ## 3. v2 — Hinglish → Hindi → English → Full (50 epochs)
160
+
161
+ | Eval | Top Hate Words | Top Non-Hate Words |
162
+ |---|---|---|
163
+ | English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
164
+ | Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
165
+ | Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
166
+ | Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
167
+
168
+ | Eval | Top Words Plot |
169
+ |---|---|
170
+ | English | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png) |
171
+ | Hindi | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png) |
172
+ | Hinglish | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png) |
173
+ | Full | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png) |
174
+
175
+ ---
176
+
177
+ ## 4. Cross-Model Comparison
178
+
179
+ Words that appear in the top-10 SHAP list of at least 3 models, side-by-side across all strategies:
180
+
181
+ **English test set:**
182
+ ![](shap/cross_model_comparison_english.png)
183
+
184
+ **Hindi test set:**
185
+ ![](shap/cross_model_comparison_hindi.png)
186
+
187
+ **Hinglish test set:**
188
+ ![](shap/cross_model_comparison_hinglish.png)
189
+
190
+ **Full test set:**
191
+ ![](shap/cross_model_comparison_full.png)
192
+
193
+ ---
194
+
195
+ ## 5. Key Findings
196
+
197
+ ### 5.1 SHAP Magnitude Reveals Language Confidence
198
+
199
+ Hindi SHAP values are consistently **one order of magnitude smaller** than English and Hinglish:
200
+
201
+ | Language | Typical Top SHAP | Interpretation |
202
+ |---|---|---|
203
+ | English | 0.03 – 0.08 | Model is confident — GloVe has rich English vectors |
204
+ | Hinglish | 0.03 – 0.07 | Model learned strong patterns despite OOV words |
205
+ | Hindi | 0.002 – 0.005 | Model is uncertain — most Hindi tokens have zero GloVe vectors |
206
+
207
+ This directly explains the lower accuracy and F1 on Hindi across all models.
208
+
209
+ ### 5.2 Consistent Non-Hate Signals Across Models
210
+
211
+ The word **"online"** (negative SHAP) and **"rajya"** (state/parliament, negative SHAP) appear as top non-hate predictors in 4 out of 6 v1 models and v2. These represent informational/political discussion context that the model correctly distinguishes from targeted hate.
212
+
213
+ ### 5.3 Hate Speech Markers Are Linguistically Coherent
214
+
215
+ - **English:** Direct slurs (`spic`, `coon`), violence language (`massacres`, `cleansing`), accusatory verbs (`blame`, `blaming`, `blamed`, `criticized`) — consistent with how hate speech presents in English social media
216
+ - **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
217
+ - **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
218
+
219
+ ### 5.4 Spurious Correlations Are Visible
220
+
221
+ Several high-SHAP words are clearly spurious:
222
+ - **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
223
+ - **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
224
+ - **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
225
+
226
+ These spurious correlations are expected limitations of GloVe + BiLSTM — without contextual embeddings (e.g. BERT), the model cannot distinguish word meaning from co-occurrence patterns.
227
+
228
+ ### 5.5 v1 vs v2 Comparison
229
+
230
+ | Aspect | v1 (8 epochs) | v2 (50 epochs) |
231
+ |---|---|---|
232
+ | English SHAP range | 0.03–0.07 | 0.02–0.06 |
233
+ | Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
234
+ | Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
235
+ | English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
236
+ | Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
237
+
238
+ v2's longer training produces slightly more semantically coherent English hate markers. The full-dataset phase in v2 notably produces **accusatory framing words** (`blamed`, `criticized`, `grown`, `advocating`) as hate predictors — reflecting that hate speech in the combined corpus often frames targets through blame/accusation rather than direct slurs.