tuklu
/

SASCv2

@@ -301,6 +301,39 @@ for text, prob in zip(texts, probs):
 ---
 ## Related
 - **v1 (all 6 strategies, 8 epochs each):** [tuklu/SASC](https://huggingface.co/tuklu/SASC)

 ---
+## Explainability — SHAP Analysis
+We applied **SHAP (SHapley Additive exPlanations)** to the final trained model to understand which words drive hate speech predictions. A `GradientExplainer` runs on the BiLSTM sub-model (embedding layer bypassed — embeddings pre-computed as floats), with 200 background training samples, evaluated on all 4 test sets.
+> Full methodology, all strategy comparisons, and detailed word tables: **[SHAP_REPORT.md](SHAP_REPORT.md)**
+### Top SHAP Words — Final Model
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
+| Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक |
+| Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
+| Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
+![SHAP — English](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png)
+![SHAP — Hinglish](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png)
+![SHAP — Hindi](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png)
+![SHAP — Full](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png)
+### Key Takeaways
+- **Hindi SHAP values are 10× smaller** than English/Hinglish — GloVe has near-zero Hindi coverage; model relies on positional patterns, not word semantics
+- **Accusatory framing dominates full-dataset hate markers** (`blamed`, `criticized`, `advocating`) — the 50-epoch Full phase learns that hate speech in this corpus often targets victims through blame/accusation rather than direct slurs
+- **"online"** is the most consistent non-hate signal — informational/conversational context across all three languages
+- **Hinglish markers are semantically coherent** (`arey` = hey/exclamation in abusive context, `punish`, `interior`) despite code-mixing — v2's 50 epochs on Hinglish-first produced stronger Hinglish feature learning than v1
+- **Spurious correlations remain** (`syntax`, `sine`) — inherent limitation of non-contextual GloVe; a BERT-based model would resolve these
+---
 ## Related
 - **v1 (all 6 strategies, 8 epochs each):** [tuklu/SASC](https://huggingface.co/tuklu/SASC)

SHAP_REPORT.md ADDED Viewed

	@@ -0,0 +1,238 @@

+# SHAP Explainability Report — Multilingual Hate Speech Detection
+**Models:** v1 (6 strategies, 8 epochs/phase) + v2 (Hinglish→Hindi→English→Full, 50 epochs/phase)
+**Explainer:** `shap.GradientExplainer` on embedding sub-model (BiLSTM → output)
+**Background:** 200 random training samples
+**Test samples per eval:** 500 per language / full set
+---
+## Table of Contents
+1. [Methodology](#1-methodology)
+2. [v1 — All 6 Strategies](#2-v1--all-6-strategies)
+3. [v2 — Hinglish → Hindi → English → Full](#3-v2--hinglish--hindi--english--full)
+4. [Cross-Model Comparison](#4-cross-model-comparison)
+5. [Key Findings](#5-key-findings)
+---
+## 1. Methodology
+### How SHAP Works Here
+Standard SHAP cannot directly handle a Keras `Embedding` layer with integer token inputs because gradients cannot flow through integer operations. We solve this by splitting the model into two parts:
+```
+Original model:
+  Integer tokens (n, 100) → Embedding → (n, 100, 300) → BiLSTM → Dense → Sigmoid
+SHAP approach:
+  Step 1: Manually look up embeddings: tokens → float embeddings (n, 100, 300)
+  Step 2: Run GradientExplainer on sub-model: embeddings → BiLSTM → Dense → Sigmoid
+  Step 3: SHAP values shape = (n, 100, 300) — one value per token per embedding dim
+  Step 4: Sum across 300 embedding dims → (n, 100) — one value per token position
+  Step 5: Map token IDs → words and aggregate mean SHAP per word
+```
+A **positive SHAP value** means the word pushes the prediction toward **hate speech**.
+A **negative SHAP value** means the word pushes the prediction toward **non-hate**.
+### Output per Model
+For each of the 7 models (6 v1 + 1 v2), evaluated on 4 test sets (English, Hindi, Hinglish, Full):
+- **Top-20 word importance bar chart** — `shap_topwords_{lang}.png`
+- **Top-50 words CSV** — `shap_topwords_{lang}.csv`
+- **Summary CSV** — top-5 hate/non-hate words per eval language
+---
+## 2. v1 — All 6 Strategies
+### Strategy 1: English → Hindi → Hinglish
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
+| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
+| Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
+| Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
+| Eval | Top Words Plot |
+|---|---|
+| English | ![](shap/english_to_hindi_to_hinglish/shap_topwords_english.png) |
+| Hindi | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hindi.png) |
+| Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
+| Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
+---
+### Strategy 2: English → Hinglish → Hindi
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
+| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
+| Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
+| Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
+| Eval | Top Words Plot |
+|---|---|
+| English | ![](shap/english_to_hinglish_to_hindi/shap_topwords_english.png) |
+| Hindi | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hindi.png) |
+| Hinglish | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hinglish.png) |
+| Full | ![](shap/english_to_hinglish_to_hindi/shap_topwords_full.png) |
+---
+### Strategy 3: Hindi → English → Hinglish ⭐ Best Model (v1)
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
+| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
+| Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
+| Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
+| Eval | Top Words Plot |
+|---|---|
+| English | ![](shap/hindi_to_english_to_hinglish/shap_topwords_english.png) |
+| Hindi | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hindi.png) |
+| Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
+| Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
+---
+### Strategy 4: Hindi → Hinglish → English
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
+| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर���शन, मुलाकात, उज्ज्वल |
+| Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
+| Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
+| Eval | Top Words Plot |
+|---|---|
+| English | ![](shap/hindi_to_hinglish_to_english/shap_topwords_english.png) |
+| Hindi | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hindi.png) |
+| Hinglish | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hinglish.png) |
+| Full | ![](shap/hindi_to_hinglish_to_english/shap_topwords_full.png) |
+---
+### Strategy 5: Hinglish → English → Hindi
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
+| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि |
+| Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
+| Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
+| Eval | Top Words Plot |
+|---|---|
+| English | ![](shap/hinglish_to_english_to_hindi/shap_topwords_english.png) |
+| Hindi | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hindi.png) |
+| Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
+| Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
+---
+### Strategy 6: Hinglish → Hindi → English
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
+| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
+| Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
+| Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
+| Eval | Top Words Plot |
+|---|---|
+| English | ![](shap/hinglish_to_hindi_to_english/shap_topwords_english.png) |
+| Hindi | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hindi.png) |
+| Hinglish | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hinglish.png) |
+| Full | ![](shap/hinglish_to_hindi_to_english/shap_topwords_full.png) |
+---
+## 3. v2 — Hinglish → Hindi → English → Full (50 epochs)
+| Eval | Top Hate Words | Top Non-Hate Words |
+|---|---|---|
+| English | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join |
+| Hindi | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो |
+| Hinglish | arey, bahir, punish, papa, interior | online, member, mam, messages, asha |
+| Full | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue |
+| Eval | Top Words Plot |
+|---|---|
+| English | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png) |
+| Hindi | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png) |
+| Hinglish | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png) |
+| Full | ![](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png) |
+---
+## 4. Cross-Model Comparison
+Words that appear in the top-10 SHAP list of at least 3 models, side-by-side across all strategies:
+**English test set:**
+![](shap/cross_model_comparison_english.png)
+**Hindi test set:**
+![](shap/cross_model_comparison_hindi.png)
+**Hinglish test set:**
+![](shap/cross_model_comparison_hinglish.png)
+**Full test set:**
+![](shap/cross_model_comparison_full.png)
+---
+## 5. Key Findings
+### 5.1 SHAP Magnitude Reveals Language Confidence
+Hindi SHAP values are consistently **one order of magnitude smaller** than English and Hinglish:
+| Language | Typical Top SHAP | Interpretation |
+|---|---|---|
+| English | 0.03 – 0.08 | Model is confident — GloVe has rich English vectors |
+| Hinglish | 0.03 – 0.07 | Model learned strong patterns despite OOV words |
+| Hindi | 0.002 – 0.005 | Model is uncertain — most Hindi tokens have zero GloVe vectors |
+This directly explains the lower accuracy and F1 on Hindi across all models.
+### 5.2 Consistent Non-Hate Signals Across Models
+The word **"online"** (negative SHAP) and **"rajya"** (state/parliament, negative SHAP) appear as top non-hate predictors in 4 out of 6 v1 models and v2. These represent informational/political discussion context that the model correctly distinguishes from targeted hate.
+### 5.3 Hate Speech Markers Are Linguistically Coherent
+- **English:** Direct slurs (`spic`, `coon`), violence language (`massacres`, `cleansing`), accusatory verbs (`blame`, `blaming`, `blamed`, `criticized`) — consistent with how hate speech presents in English social media
+- **Hinglish:** Relationship insults (`behan` — sister, used in abusive context), aggressive interjections (`arey`, `abb`, `ruk`), names in hate context (`srk`, `dada`) — reflects code-mixed abuse patterns
+- **Hindi:** Body/violence metaphors (`गला` — throat, as in strangle; `मूर्ख` — fool) and political provocations (`भूमिपूजन` — ground-breaking ceremony, polarising event)
+### 5.4 Spurious Correlations Are Visible
+Several high-SHAP words are clearly spurious:
+- **"syntax", "sine", "skua"** as hate markers in v2 full eval — rare words the model overfits to in specific hateful contexts rather than learning the word's meaning
+- **"homosexual"** as non-hate in v2 — appears in informational/news articles in the dataset rather than targeted slurs
+- **"ahh"** appearing as hate in multiple models — likely a noise/exclamation pattern co-occurring with aggressive text
+These spurious correlations are expected limitations of GloVe + BiLSTM — without contextual embeddings (e.g. BERT), the model cannot distinguish word meaning from co-occurrence patterns.
+### 5.5 v1 vs v2 Comparison
+| Aspect | v1 (8 epochs) | v2 (50 epochs) |
+|---|---|---|
+| English SHAP range | 0.03–0.07 | 0.02–0.06 |
+| Hinglish SHAP range | 0.03–0.57 | 0.04–0.46 |
+| Hindi SHAP range | 0.001–0.005 | 0.003–0.007 |
+| English hate markers | Varied, some spurious | More direct: sicko, fags, sabotage, advocating |
+| Full eval hate markers | Mixed language words | Accusatory framing: blamed, criticized |
+v2's longer training produces slightly more semantically coherent English hate markers. The full-dataset phase in v2 notably produces **accusatory framing words** (`blamed`, `criticized`, `grown`, `advocating`) as hate predictors — reflecting that hate speech in the combined corpus often frames targets through blame/accusation rather than direct slurs.