tuklu commited on
Commit
5f3cea2
·
verified ·
1 Parent(s): 3c5740e

Update SHAP report — v1 isolated, full methodology

Browse files
Files changed (1) hide show
  1. SHAP_REPORT.md +108 -93
SHAP_REPORT.md CHANGED
@@ -1,56 +1,91 @@
1
- # SHAP Explainability Report — Multilingual Hate Speech Detection
2
 
3
- **Models:** v1 (6 strategies, 8 epochs/phase) + v2 (Hinglish→Hindi→English→Full, 50 epochs/phase)
4
- **Explainer:** `shap.GradientExplainer` on embedding sub-model (BiLSTM → output)
5
- **Background:** 200 random training samples
6
- **Test samples per eval:** 500 per language / full set
7
 
8
  ---
9
 
10
  ## Table of Contents
11
- 1. [Methodology](#1-methodology)
12
- 2. [v1 All 6 Strategies](#2-v1--all-6-strategies)
13
- 3. [v2Hinglish → Hindi → English → Full](#3-v2--hinglish--hindi--english--full)
14
- 4. [Cross-Model Comparison](#4-cross-model-comparison)
15
- 5. [Key Findings](#5-key-findings)
 
 
 
 
 
16
 
17
  ---
18
 
19
- ## 1. Methodology
20
 
21
- ### How SHAP Works Here
22
 
23
- Standard SHAP cannot directly handle a Keras `Embedding` layer with integer token inputs because gradients cannot flow through integer operations. We solve this by splitting the model into two parts:
 
 
 
 
 
 
24
 
25
  ```
26
- Original model:
27
- Integer tokens (n, 100) → Embedding → (n, 100, 300) → BiLSTM → Dense → Sigmoid
28
-
29
- SHAP approach:
30
- Step 1: Manually look up embeddings: tokensfloat embeddings (n, 100, 300)
31
- Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
32
- Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
33
- Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
34
- Step 5: Map token IDs → words and aggregate mean SHAP per word
35
  ```
36
 
37
- A **positive SHAP value** means the word pushes the prediction toward **hate speech**.
38
- A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
39
 
40
- ### Output per Model
41
 
42
- For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi, Hinglish, Full):
43
- - **Top-20 word importance bar chart** `shap_topwords_{lang}.png`
44
- - **Top-50 words CSV** `shap_topwords_{lang}.csv`
45
- - **Summary CSV** — top-5 hate/non-hate words per eval language
 
 
 
46
 
47
  ---
48
 
49
- ## 2. v1 All 6 Strategies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- ### Strategy 1: English → Hindi → Hinglish
52
 
53
- | Eval | Top Hate Words | Top Non-Hate Words |
54
  |---|---|---|
55
  | English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
56
  | Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
@@ -64,11 +99,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
64
  | Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
65
  | Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
66
 
 
 
67
  ---
68
 
69
- ### Strategy 2: English → Hinglish → Hindi
70
 
71
- | Eval | Top Hate Words | Top Non-Hate Words |
72
  |---|---|---|
73
  | English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
74
  | Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
@@ -84,9 +121,11 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
84
 
85
  ---
86
 
87
- ### Strategy 3: Hindi → English → Hinglish ⭐ Best Model (v1)
88
 
89
- | Eval | Top Hate Words | Top Non-Hate Words |
 
 
90
  |---|---|---|
91
  | English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
92
  | Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
@@ -100,11 +139,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
100
  | Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
101
  | Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
102
 
 
 
103
  ---
104
 
105
- ### Strategy 4: Hindi → Hinglish → English
106
 
107
- | Eval | Top Hate Words | Top Non-Hate Words |
108
  |---|---|---|
109
  | English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
110
  | Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
@@ -120,9 +161,9 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
120
 
121
  ---
122
 
123
- ### Strategy 5: Hinglish → English → Hindi
124
 
125
- | Eval | Top Hate Words | Top Non-Hate Words |
126
  |---|---|---|
127
  | English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
128
  | Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, ���न्मभूमि |
@@ -136,11 +177,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
136
  | Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
137
  | Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
138
 
 
 
139
  ---
140
 
141
- ### Strategy 6: Hinglish → Hindi → English
142
 
143
- | Eval | Top Hate Words | Top Non-Hate Words |
144
  |---|---|---|
145
  | English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
146
  | Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
@@ -156,27 +199,9 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
156
 
157
  ---
158
 
159
- ## 3. v2 — Hinglish → Hindi → English → Full (50 epochs)
160
-
161
- | Eval | Top Hate Words | Top Non-Hate Words |
162
- |---|---|---|
163
- | English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
164
- | Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
165
- | Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
166
- | Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
167
-
168
- | Eval | Top Words Plot |
169
- |---|---|
170
- | English | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png) |
171
- | Hindi | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png) |
172
- | Hinglish | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png) |
173
- | Full | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png) |
174
-
175
- ---
176
-
177
- ## 4. Cross-Model Comparison
178
 
179
- Words that appear in the top-10 SHAP list of at least 3 models, side-by-side across all strategies:
180
 
181
  **English test set:**
182
  ![](shap/cross_model_comparison_english.png)
@@ -192,47 +217,37 @@ Words that appear in the top-10 SHAP list of at least 3 models, side-by-side acr
192
 
193
  ---
194
 
195
- ## 5. Key Findings
196
-
197
- ### 5.1 SHAP Magnitude Reveals Language Confidence
198
 
199
- Hindi SHAP values are consistently **one order of magnitude smaller** than English and Hinglish:
200
 
201
- | Language | Typical Top SHAP | Interpretation |
202
  |---|---|---|
203
- | English | 0.03 – 0.08 | Model is confident GloVe has rich English vectors |
204
- | Hinglish | 0.03 – 0.07 | Model learned strong patterns despite OOV words |
205
- | Hindi | 0.002 – 0.005 | Model is uncertain most Hindi tokens have zero GloVe vectors |
206
 
207
- This directly explains the lower accuracy and F1 on Hindi across all models.
208
 
209
- ### 5.2 Consistent Non-Hate Signals Across Models
210
 
211
- The word **"online"** (negative SHAP) and **"rajya"** (state/parliament, negative SHAP) appear as top non-hate predictors in 4 out of 6 v1 models and v2. These represent informational/political discussion context that the model correctly distinguishes from targeted hate.
 
 
212
 
213
- ### 5.3 Hate Speech Markers Are Linguistically Coherent
214
 
215
- - **English:** Direct slurs (`spic`, `coon`), violence language (`massacres`, `cleansing`), accusatory verbs (`blame`, `blaming`, `blamed`, `criticized`) consistent with how hate speech presents in English social media
216
- - **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
217
- - **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
218
 
219
- ### 5.4 Spurious Correlations Are Visible
220
 
221
- Several high-SHAP words are clearly spurious:
222
- - **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
223
- - **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
224
- - **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
225
 
226
- These spurious correlations are expected limitations of GloVe + BiLSTM — without contextual embeddings (e.g. BERT), the model cannot distinguish word meaning from co-occurrence patterns.
227
 
228
- ### 5.5 v1 vs v2 Comparison
229
-
230
- | Aspect | v1 (8 epochs) | v2 (50 epochs) |
231
- |---|---|---|
232
- | English SHAP range | 0.03–0.07 | 0.02–0.06 |
233
- | Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
234
- | Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
235
- | English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
236
- | Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
237
 
238
- v2's longer training produces slightly more semantically coherent English hate markers. The full-dataset phase in v2 notably produces **accusatory framing words** (`blamed`, `criticized`, `grown`, `advocating`) as hate predictors reflecting that hate speech in the combined corpus often frames targets through blame/accusation rather than direct slurs.
 
1
+ # SHAP Explainability Report — Multilingual Hate Speech Detection (v1)
2
 
3
+ **Models:** All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase)
4
+ **Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model
5
+ **Background:** 200 randomly sampled training sequences
6
+ **Test samples:** 500 per language (random subset), full test set = 8,852
7
 
8
  ---
9
 
10
  ## Table of Contents
11
+ 1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
12
+ 2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
13
+ 3. [Strategy 1 English → Hindi → Hinglish](#3-strategy-1--english--hindi--hinglish)
14
+ 4. [Strategy 2 — English → Hinglish → Hindi](#4-strategy-2--english--hinglish--hindi)
15
+ 5. [Strategy 3 — Hindi → English → Hinglish ⭐ Best](#5-strategy-3--hindi--english--hinglish--best)
16
+ 6. [Strategy 4 — Hindi → Hinglish → English](#6-strategy-4--hindi--hinglish--english)
17
+ 7. [Strategy 5 — Hinglish → English → Hindi](#7-strategy-5--hinglish--english--hindi)
18
+ 8. [Strategy 6 — Hinglish → Hindi → English](#8-strategy-6--hinglish--hindi--english)
19
+ 9. [Cross-Strategy Comparison](#9-cross-strategy-comparison)
20
+ 10. [Key Findings](#10-key-findings)
21
 
22
  ---
23
 
24
+ ## 1. What SHAP Is and How It Works Here
25
 
26
+ **SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"*
27
 
28
+ It does this using game theory it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
29
+
30
+ ### The Technical Challenge
31
+
32
+ Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.
33
+
34
+ **Solution — split the model into two steps:**
35
 
36
  ```
37
+ Step 1 (outside SHAP):
38
+ Token IDs (integers) ──GloVe Embedding lookup ──Float vectors (n × 100 × 300)
39
+
40
+ Step 2 (SHAP runs here):
41
+ Float embedding vectors ──→ BiLSTM ──→ Dropout ──Dense ──→ Sigmoid output (0 to 1)
 
 
 
 
42
  ```
43
 
44
+ SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word.
 
45
 
46
+ ### What the Numbers Mean
47
 
48
+ - **Positive SHAP value** word pushes the prediction **toward hate speech** (output closer to 1.0)
49
+ - **Negative SHAP value** → word pushes the prediction **toward non-hate** (output closer to 0.0)
50
+ - **Magnitude** (how large the number is) → how strongly that word influences the decision
51
+
52
+ ### Background Dataset
53
+
54
+ The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this baseline*.
55
 
56
  ---
57
 
58
+ ## 2. How the Top-20 Words Are Selected
59
+
60
+ This is **not random**. Here is the exact process:
61
+
62
+ **Step 1 — Run SHAP on the test set**
63
+ For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.
64
+
65
+ **Step 2 — Aggregate by word across all examples**
66
+ For each unique word that appears in those 500 examples, we compute the **mean SHAP value** across every occurrence of that word:
67
+
68
+ ```
69
+ word "blame" appears 47 times across 500 examples
70
+ SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
71
+ Mean SHAP = sum of all values / 47 = +0.026
72
+ ```
73
+
74
+ **Step 3 — Rank by absolute mean SHAP**
75
+ Words are sorted by `|mean SHAP|` — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
76
+
77
+ **Step 4 — Show top 20**
78
+ The 20 words with the highest `|mean SHAP|` are plotted. The colour tells you the direction:
79
+ - 🔴 Red bars extending right = pushes toward **hate**
80
+ - 🔵 Blue bars extending left = pushes toward **non-hate**
81
+
82
+ **Important:** padding tokens (ID=0) are excluded. Only real words contribute.
83
+
84
+ ---
85
 
86
+ ## 3. Strategy 1 English → Hindi → Hinglish
87
 
88
+ | Eval On | Top Hate Words | Top Non-Hate Words |
89
  |---|---|---|
90
  | English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
91
  | Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
 
99
  | Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
100
  | Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
101
 
102
+ **Note on Hindi:** The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.
103
+
104
  ---
105
 
106
+ ## 4. Strategy 2 English → Hinglish → Hindi
107
 
108
+ | Eval On | Top Hate Words | Top Non-Hate Words |
109
  |---|---|---|
110
  | English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
111
  | Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
 
121
 
122
  ---
123
 
124
+ ## 5. Strategy 3 Hindi → English → Hinglish ⭐ Best
125
 
126
+ This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).
127
+
128
+ | Eval On | Top Hate Words | Top Non-Hate Words |
129
  |---|---|---|
130
  | English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
131
  | Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
 
139
  | Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
140
  | Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
141
 
142
+ Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (`bacchi`, `behan` — familial insults common in South Asian abuse; `srk` — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.
143
+
144
  ---
145
 
146
+ ## 6. Strategy 4 Hindi → Hinglish → English
147
 
148
+ | Eval On | Top Hate Words | Top Non-Hate Words |
149
  |---|---|---|
150
  | English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
151
  | Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
 
161
 
162
  ---
163
 
164
+ ## 7. Strategy 5 Hinglish → English → Hindi
165
 
166
+ | Eval On | Top Hate Words | Top Non-Hate Words |
167
  |---|---|---|
168
  | English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
169
  | Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, ���न्मभूमि |
 
177
  | Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
178
  | Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
179
 
180
+ **Interesting:** Hindi non-hate words in this strategy include `चूतिए` and `मुल्ले` — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.
181
+
182
  ---
183
 
184
+ ## 8. Strategy 6 Hinglish → Hindi → English
185
 
186
+ | Eval On | Top Hate Words | Top Non-Hate Words |
187
  |---|---|---|
188
  | English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
189
  | Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
 
199
 
200
  ---
201
 
202
+ ## 9. Cross-Strategy Comparison
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
 
204
+ The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.
205
 
206
  **English test set:**
207
  ![](shap/cross_model_comparison_english.png)
 
217
 
218
  ---
219
 
220
+ ## 10. Key Findings
 
 
221
 
222
+ ### SHAP Magnitude Reflects Language Confidence
223
 
224
+ | Language | Typical Top |SHAP| | Why |
225
  |---|---|---|
226
+ | English | 0.03 – 0.08 | GloVe has 6B token English coverage rich, meaningful vectors |
227
+ | Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns |
228
+ | Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe near-zero vectors, tiny gradients |
229
 
230
+ This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.
231
 
232
+ ### Consistent Non-Hate Signals Across All 6 Models
233
 
234
+ - **"online"** appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
235
+ - **"rajya"** (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
236
+ - **"maine"** (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.
237
 
238
+ ### Hate Speech Markers Are Linguistically Coherent
239
 
240
+ **English:** Direct accusation verbs (`blame`, `blaming`, `criticized`) and violence vocabulary (`massacres`, `cleansing`, `violence`) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.
 
 
241
 
242
+ **Hinglish:** Familial insults (`behan` = sister, `bacchi` = girl/child — used in gendered abuse), aggressive interjections (`arey`, `abb`, `ruk`), and celebrity/political names in abusive context (`srk`, `krk`, `dada`) — reflects South Asian social media abuse patterns.
243
 
244
+ **Hindi:** Body/violence metaphors (`गला` = throat, as in strangle; `मूर्ख` = fool) and politically charged proper nouns (`भूमिपूजन` = Ram Mandir ground-breaking ceremony a highly polarising event in the dataset period).
 
 
 
245
 
246
+ ### Visible Spurious Correlations
247
 
248
+ Several high-SHAP words are clearly not meaningful hate indicators:
249
+ - **`skua`** (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
250
+ - **`ahh`** exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
251
+ - **`coon`** appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).
 
 
 
 
 
252
 
253
+ These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.