tuklu commited on
Commit
1f323e6
·
verified ·
1 Parent(s): d07fd6f

Add v2 SHAP report — isolated, full methodology

Browse files
Files changed (1) hide show
  1. SHAP_REPORT.md +91 -170
SHAP_REPORT.md CHANGED
@@ -1,238 +1,159 @@
1
- # SHAP Explainability Report — Multilingual Hate Speech Detection
2
 
3
- **Models:** v1 (6 strategies, 8 epochs/phase) + v2 (Hinglish→Hindi→English→Full, 50 epochs/phase)
4
- **Explainer:** `shap.GradientExplainer` on embedding sub-model (BiLSTM → output)
5
- **Background:** 200 random training samples
6
- **Test samples per eval:** 500 per language / full set
7
 
8
  ---
9
 
10
  ## Table of Contents
11
- 1. [Methodology](#1-methodology)
12
- 2. [v1 All 6 Strategies](#2-v1--all-6-strategies)
13
- 3. [v2 Hinglish → Hindi → English → Full](#3-v2--hinglish--hindi--english--full)
14
- 4. [Cross-Model Comparison](#4-cross-model-comparison)
15
- 5. [Key Findings](#5-key-findings)
16
 
17
  ---
18
 
19
- ## 1. Methodology
20
 
21
- ### How SHAP Works Here
22
 
23
- Standard SHAP cannot directly handle a Keras `Embedding` layer with integer token inputs because gradients cannot flow through integer operations. We solve this by splitting the model into two parts:
24
 
25
- ```
26
- Original model:
27
- Integer tokens (n, 100) → Embedding → (n, 100, 300) → BiLSTM → Dense → Sigmoid
28
-
29
- SHAP approach:
30
- Step 1: Manually look up embeddings: tokens → float embeddings (n, 100, 300)
31
- Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
32
- Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
33
- Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
34
- Step 5: Map token IDs → words and aggregate mean SHAP per word
35
- ```
36
 
37
- A **positive SHAP value** means the word pushes the prediction toward **hate speech**.
38
- A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
39
 
40
- ### Output per Model
41
 
42
- For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi, Hinglish, Full):
43
- - **Top-20 word importance bar chart** — `shap_topwords_{lang}.png`
44
- - **Top-50 words CSV** `shap_topwords_{lang}.csv`
45
- - **Summary CSV** — top-5 hate/non-hate words per eval language
46
 
47
- ---
 
 
48
 
49
- ## 2. v1All 6 Strategies
50
 
51
- ### Strategy 1: English → Hindi → Hinglish
52
 
53
- | Eval | Top Hate Words | Top Non-Hate Words |
54
- |---|---|---|
55
- | English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
56
- | Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
57
- | Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
58
- | Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
59
 
60
- | Eval | Top Words Plot |
61
- |---|---|
62
- | English | ![](shap/english_to_hindi_to_hinglish/shap_topwords_english.png) |
63
- | Hindi | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hindi.png) |
64
- | Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
65
- | Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
66
 
67
  ---
68
 
69
- ### Strategy 2: English Hinglish Hindi
70
 
71
- | Eval | Top Hate Words | Top Non-Hate Words |
72
- |---|---|---|
73
- | English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
74
- | Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
75
- | Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
76
- | Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
77
 
78
- | Eval | Top Words Plot |
79
- |---|---|
80
- | English | ![](shap/english_to_hinglish_to_hindi/shap_topwords_english.png) |
81
- | Hindi | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hindi.png) |
82
- | Hinglish | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hinglish.png) |
83
- | Full | ![](shap/english_to_hinglish_to_hindi/shap_topwords_full.png) |
84
 
85
- ---
 
86
 
87
- ### Strategy 3: Hindi → English → Hinglish ⭐ Best Model (v1)
 
 
 
 
88
 
89
- | Eval | Top Hate Words | Top Non-Hate Words |
90
- |---|---|---|
91
- | English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
92
- | Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
93
- | Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
94
- | Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
95
 
96
- | Eval | Top Words Plot |
97
- |---|---|
98
- | English | ![](shap/hindi_to_english_to_hinglish/shap_topwords_english.png) |
99
- | Hindi | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hindi.png) |
100
- | Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
101
- | Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
102
 
103
  ---
104
 
105
- ### Strategy 4: Hindi Hinglish → English
106
 
107
- | Eval | Top Hate Words | Top Non-Hate Words |
108
- |---|---|---|
109
- | English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
110
- | Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
111
- | Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
112
- | Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
113
 
114
- | Eval | Top Words Plot |
115
  |---|---|
116
- | English | ![](shap/hindi_to_hinglish_to_english/shap_topwords_english.png) |
117
- | Hindi | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hindi.png) |
118
- | Hinglish | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hinglish.png) |
119
- | Full | ![](shap/hindi_to_hinglish_to_english/shap_topwords_full.png) |
120
 
121
- ---
122
 
123
- ### Strategy 5: Hinglish English Hindi
124
 
125
- | Eval | Top Hate Words | Top Non-Hate Words |
126
- |---|---|---|
127
- | English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
128
- | Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, ��पनिषद, जन्मभूमि |
129
- | Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
130
- | Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
131
 
132
- | Eval | Top Words Plot |
133
  |---|---|
134
- | English | ![](shap/hinglish_to_english_to_hindi/shap_topwords_english.png) |
135
- | Hindi | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hindi.png) |
136
- | Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
137
- | Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
138
 
139
- ---
140
 
141
- ### Strategy 6: Hinglish Hindi English
142
 
143
- | Eval | Top Hate Words | Top Non-Hate Words |
144
- |---|---|---|
145
- | English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
146
- | Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
147
- | Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
148
- | Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
149
-
150
- | Eval | Top Words Plot |
151
- |---|---|
152
- | English | ![](shap/hinglish_to_hindi_to_english/shap_topwords_english.png) |
153
- | Hindi | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hindi.png) |
154
- | Hinglish | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hinglish.png) |
155
- | Full | ![](shap/hinglish_to_hindi_to_english/shap_topwords_full.png) |
156
 
157
  ---
158
 
159
- ## 3. v2 — Hinglish Hindi → English → Full (50 epochs)
160
-
161
- | Eval | Top Hate Words | Top Non-Hate Words |
162
- |---|---|---|
163
- | English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
164
- | Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
165
- | Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
166
- | Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
167
 
168
- | Eval | Top Words Plot |
169
  |---|---|
170
- | English | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png) |
171
- | Hindi | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png) |
172
- | Hinglish | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png) |
173
- | Full | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png) |
174
 
175
- ---
176
 
177
- ## 4. Cross-Model Comparison
178
 
179
- Words that appear in the top-10 SHAP list of at least 3 models, side-by-side across all strategies:
180
 
181
- **English test set:**
182
- ![](shap/cross_model_comparison_english.png)
183
 
184
- **Hindi test set:**
185
- ![](shap/cross_model_comparison_hindi.png)
 
186
 
187
- **Hinglish test set:**
188
- ![](shap/cross_model_comparison_hinglish.png)
189
 
190
- **Full test set:**
191
- ![](shap/cross_model_comparison_full.png)
192
 
193
  ---
194
 
195
- ## 5. Key Findings
196
-
197
- ### 5.1 SHAP Magnitude Reveals Language Confidence
198
 
199
- Hindi SHAP values are consistently **one order of magnitude smaller** than English and Hinglish:
200
 
201
- | Language | Typical Top SHAP | Interpretation |
202
  |---|---|---|
203
- | English | 0.03 – 0.08 | Model is confidentGloVe has rich English vectors |
204
- | Hinglish | 0.03 – 0.07 | Model learned strong patterns despite OOV words |
205
- | Hindi | 0.002 – 0.005 | Model is uncertain most Hindi tokens have zero GloVe vectors |
206
 
207
- This directly explains the lower accuracy and F1 on Hindi across all models.
208
 
209
- ### 5.2 Consistent Non-Hate Signals Across Models
210
 
211
- The word **"online"** (negative SHAP) and **"rajya"** (state/parliament, negative SHAP) appear as top non-hate predictors in 4 out of 6 v1 models and v2. These represent informational/political discussion context that the model correctly distinguishes from targeted hate.
 
 
 
212
 
213
- ### 5.3 Hate Speech Markers Are Linguistically Coherent
214
 
215
- - **English:** Direct slurs (`spic`, `coon`), violence language (`massacres`, `cleansing`), accusatory verbs (`blame`, `blaming`, `blamed`, `criticized`)consistent with how hate speech presents in English social media
216
- - **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
217
- - **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
218
 
219
- ### 5.4 Spurious Correlations Are Visible
220
 
221
- Several high-SHAP words are clearly spurious:
222
- - **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
223
- - **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
224
- - **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
225
-
226
- These spurious correlations are expected limitations of GloVe + BiLSTM — without contextual embeddings (e.g. BERT), the model cannot distinguish word meaning from co-occurrence patterns.
227
-
228
- ### 5.5 v1 vs v2 Comparison
229
-
230
- | Aspect | v1 (8 epochs) | v2 (50 epochs) |
231
- |---|---|---|
232
- | English SHAP range | 0.03–0.07 | 0.02–0.06 |
233
- | Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
234
- | Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
235
- | English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
236
- | Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
237
 
238
- v2's longer training produces slightly more semantically coherent English hate markers. The full-dataset phase in v2 notably produces **accusatory framing words** (`blamed`, `criticized`, `grown`, `advocating`) as hate predictors — reflecting that hate speech in the combined corpus often frames targets through blame/accusation rather than direct slurs.
 
1
+ # SHAP Explainability Report — Multilingual Hate Speech Detection (v2)
2
 
3
+ **Model:** Hinglish Hindi English Full (GloVe + BiLSTM, 50 epochs/phase, 200 total)
4
+ **Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model
5
+ **Background:** 200 randomly sampled training sequences
6
+ **Test samples:** 500 per language (random subset), full test set = 8,852
7
 
8
  ---
9
 
10
  ## Table of Contents
11
+ 1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
12
+ 2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
13
+ 3. [Results by Evaluation Language](#3-results-by-evaluation-language)
14
+ 4. [Key Findings](#4-key-findings)
 
15
 
16
  ---
17
 
18
+ ## 1. What SHAP Is and How It Works Here
19
 
20
+ **SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"*
21
 
22
+ It does this using game theory it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
23
 
24
+ ### The Technical Challenge
 
 
 
 
 
 
 
 
 
 
25
 
26
+ Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each word's influence. But you cannot take a gradient through an integer operation.
 
27
 
28
+ **Solution split the model into two steps:**
29
 
30
+ ```
31
+ Step 1 (outside SHAP):
32
+ Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)
 
33
 
34
+ Step 2 (SHAP runs here):
35
+ Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
36
+ ```
37
 
38
+ SHAP's `GradientExplainer` runs on Step 2. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word its overall contribution.
39
 
40
+ ### What the Numbers Mean
41
 
42
+ - **Positive SHAP value** word pushes prediction **toward hate speech** (output closer to 1.0)
43
+ - **Negative SHAP value** → word pushes prediction **toward non-hate** (output closer to 0.0)
44
+ - **Magnitude** how strongly that word influences the decision
 
 
 
45
 
46
+ ### Background Dataset
47
+
48
+ The explainer needs a "baseline" — what does the model output with no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this average*.
 
 
 
49
 
50
  ---
51
 
52
+ ## 2. How the Top-20 Words Are Selected
53
 
54
+ This is **not random**. Here is the exact process:
 
 
 
 
 
55
 
56
+ **Step 1 Run SHAP on the test set**
57
+ For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence → 100 SHAP values.
 
 
 
 
58
 
59
+ **Step 2 — Aggregate by word across all examples**
60
+ For each unique word appearing in those 500 examples, compute the **mean SHAP value** across all its occurrences:
61
 
62
+ ```
63
+ word "blamed" appears 31 times across 500 examples
64
+ SHAP values: [+0.061, +0.058, +0.072, ..., +0.055]
65
+ Mean SHAP = sum / 31 = +0.064
66
+ ```
67
 
68
+ **Step 3 Rank by absolute mean SHAP**
69
+ Words sorted by `|mean SHAP|` — captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
 
 
 
 
70
 
71
+ **Step 4 Show top 20**
72
+ The 20 words with the highest `|mean SHAP|` are plotted:
73
+ - 🔴 Red bars extending right = pushes toward **hate**
74
+ - 🔵 Blue bars extending left = pushes toward **non-hate**
75
+
76
+ Padding tokens (ID=0) are always excluded. Only real words contribute.
77
 
78
  ---
79
 
80
+ ## 3. Results by Evaluation Language
81
 
82
+ ### 3.1 English Test Set
 
 
 
 
 
83
 
84
+ | Top Hate Words | Top Non-Hate Words |
85
  |---|---|
86
+ | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
 
 
 
87
 
88
+ ![SHAP Top Words — English](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png)
89
 
90
+ The English hate markers skew toward **explicit aggression** (`sicko`, `fags`) and **intentional framing** (`sabotage`, `advocating`) — the 50-epoch English phase produced sharper English feature learning than shorter-training models. The appearance of `homosexual` as a non-hate word reflects that this term appears more in informational/news text in the dataset than in targeted slurs.
91
 
92
+ ---
93
+
94
+ ### 3.2 Hindi Test Set
 
 
 
95
 
96
+ | Top Hate Words | Top Non-Hate Words |
97
  |---|---|
98
+ | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
 
 
 
99
 
100
+ ![SHAP Top Words — Hindi](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png)
101
 
102
+ **Note on bar size:** Hindi SHAP values are approximately 10× smaller than English (range ~0.003–0.007 vs ~0.02–0.06). This is not a model failure — it reflects that GloVe was trained on English text and has near-zero vectors for Devanagari words. The model makes Hindi predictions from positional/sequential patterns rather than word semantics. The words `गला` (throat — as in "गला घोंटना", strangle) and `ऐ` (hey — aggressive address) showing as hate markers suggest the model did capture some Hindi-specific abuse patterns through the 50-epoch Hinglish-first training.
103
 
104
+ Interestingly, `जिहादी` (jihadist) appears as a **non-hate** predictor — this likely reflects news/reporting usage in the dataset rather than targeted hate, and shows how context-free embeddings can produce counterintuitive results.
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
  ---
107
 
108
+ ### 3.3 Hinglish Test Set
 
 
 
 
 
 
 
109
 
110
+ | Top Hate Words | Top Non-Hate Words |
111
  |---|---|
112
+ | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
 
 
 
113
 
114
+ ![SHAP Top Words — Hinglish](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png)
115
 
116
+ The Hinglish markers are notably **semantically coherent** compared to v1 (8-epoch) models. `arey` is a Hindi-derived exclamation almost always used in confrontational tone in social media; `bahir` (outside/get out) reflects exclusionary language; `punish` as a verb signals threat. The non-hate side shows platform/conversation context (`online`, `messages`, `mam`). Starting on Hinglish for 50 epochs produced the strongest Hinglish-specific feature learning of any model trained in this project.
117
 
118
+ ---
119
 
120
+ ### 3.4 Full Test Set
 
121
 
122
+ | Top Hate Words | Top Non-Hate Words |
123
+ |---|---|
124
+ | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
125
 
126
+ ![SHAP Top Words — Full](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png)
 
127
 
128
+ The full-dataset view shows **accusatory framing** (`blamed`, `criticized`) as the dominant hate signal — more so than in any v1 model. The 50-epoch Full phase appears to have learned that hate speech in this multilingual corpus characteristically targets victims through blame and accusation, regardless of language. The spurious markers `syntax` and `sine` (very rare technical words) are outliers that co-occurred with hate content in specific training examples.
 
129
 
130
  ---
131
 
132
+ ## 4. Key Findings
 
 
133
 
134
+ ### SHAP Magnitude Reflects Language Confidence
135
 
136
+ | Language | Typical Top |SHAP| | Why |
137
  |---|---|---|
138
+ | English | 0.02 – 0.06 | GloVe covers English comprehensively strong, directional gradients |
139
+ | Hinglish | 0.04 – 0.46 | Roman-script words partially covered; model learned strong code-mixed patterns from Phase 1 |
140
+ | Hindi | 0.003 – 0.007 | Devanagari script is OOV in GloVe near-zero vectors produce tiny gradients |
141
 
142
+ Hindi confidence is 10× lower than English. This is a fundamental limitation of using GloVe (English-trained) for multilingual text. A multilingual model (e.g. mBERT, MuRIL) would show much more balanced confidence across languages.
143
 
144
+ ### Effect of 50-Epoch Training vs 8-Epoch
145
 
146
+ Compared to the equivalent v1 strategy (Hinglish Hindi English, 8 epochs), the v2 model shows:
147
+ - **More directional Hinglish markers** — `arey`, `bahir`, `punish` are contextually coherent; v1 had noisier Hinglish top words
148
+ - **Accusatory framing as primary hate signal** — `blamed`, `criticized` in full eval rather than random rare words; deeper training on the Full phase produced this generalisation
149
+ - **Similar English markers** — both models converge on English hate vocabulary after the English phase; the difference is mainly in non-English languages
150
 
151
+ ### Consistent Non-Hate Signal
152
 
153
+ **"online"** is the most stable non-hate predictor across all four evaluation sets. It appears in informational, conversational, and reporting contexts in all three languages the model correctly identifies this as a non-toxic context word.
 
 
154
 
155
+ ### Spurious Correlations
156
 
157
+ `syntax` and `sine` appearing as hate markers in the full evaluation are clear spurious correlations — rare technical words that happened to co-occur with hateful content in a small number of training examples. The BiLSTM cannot distinguish this from meaningful signal because GloVe has no semantic grounding for these words in a hate/non-hate context.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
+ This is the primary motivation for contextual models: BERT-based approaches would represent the same word differently depending on surrounding context, eliminating most spurious co-occurrence patterns.