tuklu
/

SASC

@@ -1,56 +1,91 @@
-# SHAP Explainability Report — Multilingual Hate Speech Detection
-**Models:** v1 (6 strategies, 8 epochs/phase) + v2 (Hinglish→Hindi→English→Full, 50 epochs/phase)
-**Explainer:** `shap.GradientExplainer` on embedding sub-model (BiLSTM → output)
-**Background:** 200 random training samples
-**Test samples per eval:** 500 per language / full set
 ---
 ## Table of Contents
-1. [Methodology](#1-methodology)
-2. [v1 — All 6 Strategies](#2-v1--all-6-strategies)
-3. [v2 — Hinglish → Hindi → English → Full](#3-v2--hinglish--hindi--english--full)
-4. [Cross-Model Comparison](#4-cross-model-comparison)
-5. [Key Findings](#5-key-findings)
 ---
-## 1. Methodology
-### How SHAP Works Here
-Standard SHAP cannot directly handle a Keras `Embedding` layer with integer token inputs because gradients cannot flow through integer operations. We solve this by splitting the model into two parts:
 ```
-Original model:
-  Integer tokens (n, 100) → Embedding → (n, 100, 300) → BiLSTM → Dense → Sigmoid
-SHAP approach:
-  Step 1: Manually look up embeddings: tokens → float embeddings (n, 100, 300)
-  Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
-  Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
-  Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
-  Step 5: Map token IDs → words and aggregate mean SHAP per word
 ```
-A **positive SHAP value** means the word pushes the prediction toward **hate speech**.
-A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
-### Output per Model
-For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi, Hinglish, Full):
-- **Top-20 word importance bar chart** — `shap_topwords_{lang}.png`
-- **Top-50 words CSV** — `shap_topwords_{lang}.csv`
-- **Summary CSV** — top-5 hate/non-hate words per eval language
 ---
-## 2. v1 — All 6 Strategies
-### Strategy 1: English → Hindi → Hinglish
-| Eval | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
 | Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
@@ -64,11 +99,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
 | Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
 | Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
 ---
-### Strategy 2: English → Hinglish → Hindi
-| Eval | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
 | Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
@@ -84,9 +121,11 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
 ---
-### Strategy 3: Hindi → English → Hinglish ⭐ Best Model (v1)
-| Eval | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
 | Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
@@ -100,11 +139,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
 | Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
 | Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
 ---
-### Strategy 4: Hindi → Hinglish → English
-| Eval | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
 | Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
@@ -120,9 +161,9 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
 ---
-### Strategy 5: Hinglish → English → Hindi
-| Eval | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
 | Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, ���न्मभूमि |
@@ -136,11 +177,13 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
 | Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
 | Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
 ---
-### Strategy 6: Hinglish → Hindi → English
-| Eval | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
 | Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
@@ -156,27 +199,9 @@ For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi
 ---
-## 3. v2 — Hinglish → Hindi → English → Full (50 epochs)
-| Eval | Top Hate Words | Top Non-Hate Words |
-|---|---|---|
-| English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
-| Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
-| Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
-| Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
-| Eval | Top Words Plot |
-|---|---|
-| English | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png) |
-| Hindi | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png) |
-| Hinglish | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png) |
-| Full | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png) |
----
-## 4. Cross-Model Comparison
-Words that appear in the top-10 SHAP list of at least 3 models, side-by-side across all strategies:
 **English test set:**
 ![](shap/cross_model_comparison_english.png)
@@ -192,47 +217,37 @@ Words that appear in the top-10 SHAP list of at least 3 models, side-by-side acr
 ---
-## 5. Key Findings
-### 5.1 SHAP Magnitude Reveals Language Confidence
-Hindi SHAP values are consistently **one order of magnitude smaller** than English and Hinglish:
-| Language | Typical Top SHAP | Interpretation |
 |---|---|---|
-| English | 0.03 – 0.08 | Model is confident — GloVe has rich English vectors |
-| Hinglish | 0.03 – 0.07 | Model learned strong patterns despite OOV words |
-| Hindi | 0.002 – 0.005 | Model is uncertain — most Hindi tokens have zero GloVe vectors |
-This directly explains the lower accuracy and F1 on Hindi across all models.
-### 5.2 Consistent Non-Hate Signals Across Models
-The word **"online"** (negative SHAP) and **"rajya"** (state/parliament, negative SHAP) appear as top non-hate predictors in 4 out of 6 v1 models and v2. These represent informational/political discussion context that the model correctly distinguishes from targeted hate.
-### 5.3 Hate Speech Markers Are Linguistically Coherent
-- **English:** Direct slurs (`spic`, `coon`), violence language (`massacres`, `cleansing`), accusatory verbs (`blame`, `blaming`, `blamed`, `criticized`) — consistent with how hate speech presents in English social media
-- **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
-- **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
-### 5.4 Spurious Correlations Are Visible
-Several high-SHAP words are clearly spurious:
-- **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
-- **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
-- **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
-These spurious correlations are expected limitations of GloVe + BiLSTM — without contextual embeddings (e.g. BERT), the model cannot distinguish word meaning from co-occurrence patterns.
-### 5.5 v1 vs v2 Comparison
-| Aspect | v1 (8 epochs) | v2 (50 epochs) |
-|---|---|---|
-| English SHAP range | 0.03–0.07 | 0.02–0.06 |
-| Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
-| Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
-| English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
-| Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
-v2's longer training produces slightly more semantically coherent English hate markers. The full-dataset phase in v2 notably produces **accusatory framing words** (`blamed`, `criticized`, `grown`, `advocating`) as hate predictors — reflecting that hate speech in the combined corpus often frames targets through blame/accusation rather than direct slurs.

+# SHAP Explainability Report — Multilingual Hate Speech Detection (v1)
+**Models:** All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase)
+**Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model
+**Background:** 200 randomly sampled training sequences
+**Test samples:** 500 per language (random subset), full test set = 8,852
 ---
 ## Table of Contents
+1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
+2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
+3. [Strategy 1 — English → Hindi → Hinglish](#3-strategy-1--english--hindi--hinglish)
+4. [Strategy 2 — English → Hinglish → Hindi](#4-strategy-2--english--hinglish--hindi)
+5. [Strategy 3 — Hindi → English → Hinglish ⭐ Best](#5-strategy-3--hindi--english--hinglish--best)
+6. [Strategy 4 — Hindi → Hinglish → English](#6-strategy-4--hindi--hinglish--english)
+7. [Strategy 5 — Hinglish → English → Hindi](#7-strategy-5--hinglish--english--hindi)
+8. [Strategy 6 — Hinglish → Hindi → English](#8-strategy-6--hinglish--hindi--english)
+9. [Cross-Strategy Comparison](#9-cross-strategy-comparison)
+10. [Key Findings](#10-key-findings)
 ---
+## 1. What SHAP Is and How It Works Here
+**SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"*
+It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
+### The Technical Challenge
+Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.
+**Solution — split the model into two steps:**
 ```
+Step 1 (outside SHAP):
+  Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)
+Step 2 (SHAP runs here):
+  Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
 ```
+SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word.
+### What the Numbers Mean
+- **Positive SHAP value** → word pushes the prediction **toward hate speech** (output closer to 1.0)
+- **Negative SHAP value** → word pushes the prediction **toward non-hate** (output closer to 0.0)
+- **Magnitude** (how large the number is) → how strongly that word influences the decision
+### Background Dataset
+The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this baseline*.
 ---
+## 2. How the Top-20 Words Are Selected
+This is **not random**. Here is the exact process:
+**Step 1 — Run SHAP on the test set**
+For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.
+**Step 2 — Aggregate by word across all examples**
+For each unique word that appears in those 500 examples, we compute the **mean SHAP value** across every occurrence of that word:
+```
+word "blame" appears 47 times across 500 examples
+SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
+Mean SHAP = sum of all values / 47 = +0.026
+```
+**Step 3 — Rank by absolute mean SHAP**
+Words are sorted by `|mean SHAP|` — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
+**Step 4 — Show top 20**
+The 20 words with the highest `|mean SHAP|` are plotted. The colour tells you the direction:
+- 🔴 Red bars extending right = pushes toward **hate**
+- 🔵 Blue bars extending left = pushes toward **non-hate**
+**Important:** padding tokens (ID=0) are excluded. Only real words contribute.
+---
+## 3. Strategy 1 — English → Hindi → Hinglish
+| Eval On | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
 | Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
 | Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
 | Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
+**Note on Hindi:** The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.
 ---
+## 4. Strategy 2 — English → Hinglish → Hindi
+| Eval On | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
 | Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
 ---
+## 5. Strategy 3 — Hindi → English → Hinglish ⭐ Best
+This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).
+| Eval On | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
 | Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
 | Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
 | Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
+Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (`bacchi`, `behan` — familial insults common in South Asian abuse; `srk` — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.
 ---
+## 6. Strategy 4 — Hindi → Hinglish → English
+| Eval On | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
 | Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
 ---
+## 7. Strategy 5 — Hinglish → English → Hindi
+| Eval On | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
 | Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, ���न्मभूमि |
 | Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
 | Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
+**Interesting:** Hindi non-hate words in this strategy include `चूतिए` and `मुल्ले` — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.
 ---
+## 8. Strategy 6 — Hinglish → Hindi → English
+| Eval On | Top Hate Words | Top Non-Hate Words |
 |---|---|---|
 | English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
 | Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
 ---
+## 9. Cross-Strategy Comparison
+The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.
 **English test set:**
 ![](shap/cross_model_comparison_english.png)
 ---
+## 10. Key Findings
+### SHAP Magnitude Reflects Language Confidence
+| Language | Typical Top |SHAP| | Why |
 |---|---|---|
+| English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors |
+| Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns |
+| Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients |
+This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.
+### Consistent Non-Hate Signals Across All 6 Models
+- **"online"** — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
+- **"rajya"** (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
+- **"maine"** (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.
+### Hate Speech Markers Are Linguistically Coherent
+**English:** Direct accusation verbs (`blame`, `blaming`, `criticized`) and violence vocabulary (`massacres`, `cleansing`, `violence`) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.
+**Hinglish:** Familial insults (`behan` = sister, `bacchi` = girl/child — used in gendered abuse), aggressive interjections (`arey`, `abb`, `ruk`), and celebrity/political names in abusive context (`srk`, `krk`, `dada`) — reflects South Asian social media abuse patterns.
+**Hindi:** Body/violence metaphors (`गला` = throat, as in strangle; `मूर्ख` = fool) and politically charged proper nouns (`भूमिपूजन` = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).
+### Visible Spurious Correlations
+Several high-SHAP words are clearly not meaningful hate indicators:
+- **`skua`** (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
+- **`ahh`** — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
+- **`coon`** appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).
+These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.