SASC / SHAP_REPORT.md

Update SHAP report — v1 isolated, full methodology

5f3cea2 verified 5 days ago

14 kB

	# SHAP Explainability Report — Multilingual Hate Speech Detection (v1)

	Models: All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase)
	Explainer: `shap.GradientExplainer` on the BiLSTM sub-model
	Background: 200 randomly sampled training sequences
	Test samples: 500 per language (random subset), full test set = 8,852

	---

	## Table of Contents
	1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
	2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
	3. [Strategy 1 — English → Hindi → Hinglish](#3-strategy-1--english--hindi--hinglish)
	4. [Strategy 2 — English → Hinglish → Hindi](#4-strategy-2--english--hinglish--hindi)
	5. [Strategy 3 — Hindi → English → Hinglish ⭐ Best](#5-strategy-3--hindi--english--hinglish--best)
	6. [Strategy 4 — Hindi → Hinglish → English](#6-strategy-4--hindi--hinglish--english)
	7. [Strategy 5 — Hinglish → English → Hindi](#7-strategy-5--hinglish--english--hindi)
	8. [Strategy 6 — Hinglish → Hindi → English](#8-strategy-6--hinglish--hindi--english)
	9. [Cross-Strategy Comparison](#9-cross-strategy-comparison)
	10. [Key Findings](#10-key-findings)

	---

	## 1. What SHAP Is and How It Works Here

	SHAP (SHapley Additive exPlanations) answers the question: "For this specific prediction, how much did each word contribute to the output?"

	It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.

	### The Technical Challenge

	Our model takes integer token IDs as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.

	Solution — split the model into two steps:

	```
	Step 1 (outside SHAP):
	Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)

	Step 2 (SHAP runs here):
	Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
	```

	SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then sum across all 300 dimensions for each position to get one number per word.

	### What the Numbers Mean

	- Positive SHAP value → word pushes the prediction toward hate speech (output closer to 1.0)
	- Negative SHAP value → word pushes the prediction toward non-hate (output closer to 0.0)
	- Magnitude (how large the number is) → how strongly that word influences the decision

	### Background Dataset

	The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution relative to this baseline.

	---

	## 2. How the Top-20 Words Are Selected

	This is not random. Here is the exact process:

	Step 1 — Run SHAP on the test set
	For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.

	Step 2 — Aggregate by word across all examples
	For each unique word that appears in those 500 examples, we compute the mean SHAP value across every occurrence of that word:

	```
	word "blame" appears 47 times across 500 examples
	SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
	Mean SHAP = sum of all values / 47 = +0.026
	```

	Step 3 — Rank by absolute mean SHAP
	Words are sorted by `\|mean SHAP\|` — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).

	Step 4 — Show top 20
	The 20 words with the highest `\|mean SHAP\|` are plotted. The colour tells you the direction:
	- 🔴 Red bars extending right = pushes toward hate
	- 🔵 Blue bars extending left = pushes toward non-hate

	Important: padding tokens (ID=0) are excluded. Only real words contribute.

	---

	## 3. Strategy 1 — English → Hindi → Hinglish

	\| Eval On \| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|---\|
	\| English \| blame, cretin, blaming, unhelpful, upwards \| nt, wwf, facebook, ahh, cum \|
	\| Hindi \| रे, पढ़ने, लाइव, गाला, औखत \| आखिरकार, मुख, बिकुल, शायद, ह \|
	\| Hinglish \| nawaz, dhawan, bashing, shareef, scn \| gau, age, rajya, chori, channels \|
	\| Full \| molvi, chalo, molana, scn, elitist \| rajya, coding, meat, haan, maine \|

	\| Eval \| Top Words Plot \|
	\|---\|---\|
	\| English \| ![](shap/english_to_hindi_to_hinglish/shap_topwords_english.png) \|
	\| Hindi \| ![](shap/english_to_hindi_to_hinglish/shap_topwords_hindi.png) \|
	\| Hinglish \| ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) \|
	\| Full \| ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) \|

	Note on Hindi: The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.

	---

	## 4. Strategy 2 — English → Hinglish → Hindi

	\| Eval On \| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|---\|
	\| English \| grave, svi, vox, ahh, grown \| buried, coon, ane, million, normally \|
	\| Hindi \| नड्डा, हिसक, बड़े, सांसदों, रद्दीके \| समाज, सबकी, सऊदी, समझा, अमिताभ \|
	\| Hinglish \| khi, dada, kiske, srk, chalo \| gau, tk, online, liberty, taraf \|
	\| Full \| dada, roj, shamelessness, tujhe, epic \| akash, sapna, proud, buddy, episcopal \|

	\| Eval \| Top Words Plot \|
	\|---\|---\|
	\| English \| ![](shap/english_to_hinglish_to_hindi/shap_topwords_english.png) \|
	\| Hindi \| ![](shap/english_to_hinglish_to_hindi/shap_topwords_hindi.png) \|
	\| Hinglish \| ![](shap/english_to_hinglish_to_hindi/shap_topwords_hinglish.png) \|
	\| Full \| ![](shap/english_to_hinglish_to_hindi/shap_topwords_full.png) \|

	---

	## 5. Strategy 3 — Hindi → English → Hinglish ⭐ Best

	This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).

	\| Eval On \| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|---\|
	\| English \| credence, bj, rosario, ghazi, eni \| plain, stranger, sarcasm, rubbish, comprise \|
	\| Hindi \| कॉल, भूमिपूजन, लें, आधी, मूर्ख \| मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े \|
	\| Hinglish \| bacchi, bull, srk, bahana, behan \| madrassa, zaida, gdp, bech, nd \|
	\| Full \| skua, brut, cleansing, captaincy, baar \| taraf, pussy, directory, quran, kaha \|

	\| Eval \| Top Words Plot \|
	\|---\|---\|
	\| English \| ![](shap/hindi_to_english_to_hinglish/shap_topwords_english.png) \|
	\| Hindi \| ![](shap/hindi_to_english_to_hinglish/shap_topwords_hindi.png) \|
	\| Hinglish \| ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) \|
	\| Full \| ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) \|

	Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (`bacchi`, `behan` — familial insults common in South Asian abuse; `srk` — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.

	---

	## 6. Strategy 4 — Hindi → Hinglish → English

	\| Eval On \| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|---\|
	\| English \| violence, ahh, potus, spic, undocumented \| beginner, dollars, bih, messages, total \|
	\| Hindi \| जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय \| नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल \|
	\| Hinglish \| tatti, sham, dino, roko, krk \| lac, online, ancestor, zaida, target \|
	\| Full \| ahh, moi, bj, fault, pan \| asperger, hundred, database, wicked, nam \|

	\| Eval \| Top Words Plot \|
	\|---\|---\|
	\| English \| ![](shap/hindi_to_hinglish_to_english/shap_topwords_english.png) \|
	\| Hindi \| ![](shap/hindi_to_hinglish_to_english/shap_topwords_hindi.png) \|
	\| Hinglish \| ![](shap/hindi_to_hinglish_to_english/shap_topwords_hinglish.png) \|
	\| Full \| ![](shap/hindi_to_hinglish_to_english/shap_topwords_full.png) \|

	---

	## 7. Strategy 5 — Hinglish → English → Hindi

	\| Eval On \| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|---\|
	\| English \| bastard, establishes, code, poo, hub \| blatantly, languages, turkey, fags, gear \|
	\| Hindi \| रंजन, गोगोई, नड्डा, सांसदों, पित्त \| चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि \|
	\| Hinglish \| huye, dada, abb, arey, abduction \| rajya, bahu, parliament, code, music \|
	\| Full \| skua, praised, spic, sabse, plz \| liberty, languages, speaks, bache, maine \|

	\| Eval \| Top Words Plot \|
	\|---\|---\|
	\| English \| ![](shap/hinglish_to_english_to_hindi/shap_topwords_english.png) \|
	\| Hindi \| ![](shap/hinglish_to_english_to_hindi/shap_topwords_hindi.png) \|
	\| Hinglish \| ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) \|
	\| Full \| ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) \|

	Interesting: Hindi non-hate words in this strategy include `चूतिए` and `मुल्ले` — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.

	---

	## 8. Strategy 6 — Hinglish → Hindi → English

	\| Eval On \| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|---\|
	\| English \| opponents, massacres, coon, ahh, fitness \| annie, model, nearly, lloyd, nest \|
	\| Hindi \| लें, अमिताभ, मी, करेंगे, रखता \| आंखें, जे, बुरे, लोगे, जायज़ा \|
	\| Hinglish \| fav, janab, chori, cum, ruk \| online, gau, dehli, 2017, rajya \|
	\| Full \| srk, roj, rhi, purana, aapke \| nest, maine, hone, haired, barrel \|

	\| Eval \| Top Words Plot \|
	\|---\|---\|
	\| English \| ![](shap/hinglish_to_hindi_to_english/shap_topwords_english.png) \|
	\| Hindi \| ![](shap/hinglish_to_hindi_to_english/shap_topwords_hindi.png) \|
	\| Hinglish \| ![](shap/hinglish_to_hindi_to_english/shap_topwords_hinglish.png) \|
	\| Full \| ![](shap/hinglish_to_hindi_to_english/shap_topwords_full.png) \|

	---

	## 9. Cross-Strategy Comparison

	The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.

	English test set:
	![](shap/cross_model_comparison_english.png)

	Hindi test set:
	![](shap/cross_model_comparison_hindi.png)

	Hinglish test set:
	![](shap/cross_model_comparison_hinglish.png)

	Full test set:
	![](shap/cross_model_comparison_full.png)

	---

	## 10. Key Findings

	### SHAP Magnitude Reflects Language Confidence

	\| Language \| Typical Top \|SHAP\| \| Why \|
	\|---\|---\|---\|
	\| English \| 0.03 – 0.08 \| GloVe has 6B token English coverage — rich, meaningful vectors \|
	\| Hinglish \| 0.03 – 0.07 \| Roman-script words partially covered by GloVe; model learns strong patterns \|
	\| Hindi \| 0.002 – 0.005 \| Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients \|

	This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.

	### Consistent Non-Hate Signals Across All 6 Models

	- "online" — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
	- "rajya" (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
	- "maine" (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.

	### Hate Speech Markers Are Linguistically Coherent

	English: Direct accusation verbs (`blame`, `blaming`, `criticized`) and violence vocabulary (`massacres`, `cleansing`, `violence`) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.

	Hinglish: Familial insults (`behan` = sister, `bacchi` = girl/child — used in gendered abuse), aggressive interjections (`arey`, `abb`, `ruk`), and celebrity/political names in abusive context (`srk`, `krk`, `dada`) — reflects South Asian social media abuse patterns.

	Hindi: Body/violence metaphors (`गला` = throat, as in strangle; `मूर्ख` = fool) and politically charged proper nouns (`भूमिपूजन` = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).

	### Visible Spurious Correlations

	Several high-SHAP words are clearly not meaningful hate indicators:
	- `skua` (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
	- `ahh` — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
	- `coon` appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).

	These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.