SASCv2 / SHAP_REPORT.md

Add v2 SHAP report — isolated, full methodology

1f323e6 verified 6 days ago

9.11 kB

	# SHAP Explainability Report — Multilingual Hate Speech Detection (v2)

	Model: Hinglish → Hindi → English → Full (GloVe + BiLSTM, 50 epochs/phase, 200 total)
	Explainer: `shap.GradientExplainer` on the BiLSTM sub-model
	Background: 200 randomly sampled training sequences
	Test samples: 500 per language (random subset), full test set = 8,852

	---

	## Table of Contents
	1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
	2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
	3. [Results by Evaluation Language](#3-results-by-evaluation-language)
	4. [Key Findings](#4-key-findings)

	---

	## 1. What SHAP Is and How It Works Here

	SHAP (SHapley Additive exPlanations) answers the question: "For this specific prediction, how much did each word contribute to the output?"

	It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.

	### The Technical Challenge

	Our model takes integer token IDs as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each word's influence. But you cannot take a gradient through an integer operation.

	Solution — split the model into two steps:

	```
	Step 1 (outside SHAP):
	Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)

	Step 2 (SHAP runs here):
	Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
	```

	SHAP's `GradientExplainer` runs on Step 2. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then sum across all 300 dimensions for each position to get one number per word — its overall contribution.

	### What the Numbers Mean

	- Positive SHAP value → word pushes prediction toward hate speech (output closer to 1.0)
	- Negative SHAP value → word pushes prediction toward non-hate (output closer to 0.0)
	- Magnitude → how strongly that word influences the decision

	### Background Dataset

	The explainer needs a "baseline" — what does the model output with no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution relative to this average.

	---

	## 2. How the Top-20 Words Are Selected

	This is not random. Here is the exact process:

	Step 1 — Run SHAP on the test set
	For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence → 100 SHAP values.

	Step 2 — Aggregate by word across all examples
	For each unique word appearing in those 500 examples, compute the mean SHAP value across all its occurrences:

	```
	word "blamed" appears 31 times across 500 examples
	SHAP values: [+0.061, +0.058, +0.072, ..., +0.055]
	Mean SHAP = sum / 31 = +0.064
	```

	Step 3 — Rank by absolute mean SHAP
	Words sorted by `\|mean SHAP\|` — captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).

	Step 4 — Show top 20
	The 20 words with the highest `\|mean SHAP\|` are plotted:
	- 🔴 Red bars extending right = pushes toward hate
	- 🔵 Blue bars extending left = pushes toward non-hate

	Padding tokens (ID=0) are always excluded. Only real words contribute.

	---

	## 3. Results by Evaluation Language

	### 3.1 English Test Set

	\| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|
	\| nas, fags, sicko, sabotage, advocating \| grow, barrel, homosexual, pak, join \|

	![SHAP Top Words — English](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_english.png)

	The English hate markers skew toward explicit aggression (`sicko`, `fags`) and intentional framing (`sabotage`, `advocating`) — the 50-epoch English phase produced sharper English feature learning than shorter-training models. The appearance of `homosexual` as a non-hate word reflects that this term appears more in informational/news text in the dataset than in targeted slurs.

	---

	### 3.2 Hindi Test Set

	\| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|
	\| वादा, वैज्ञानिकों, ऐ, उतारा, गला \| जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो \|

	![SHAP Top Words — Hindi](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hindi.png)

	Note on bar size: Hindi SHAP values are approximately 10× smaller than English (range ~0.003–0.007 vs ~0.02–0.06). This is not a model failure — it reflects that GloVe was trained on English text and has near-zero vectors for Devanagari words. The model makes Hindi predictions from positional/sequential patterns rather than word semantics. The words `गला` (throat — as in "गला घोंटना", strangle) and `ऐ` (hey — aggressive address) showing as hate markers suggest the model did capture some Hindi-specific abuse patterns through the 50-epoch Hinglish-first training.

	Interestingly, `जिहादी` (jihadist) appears as a non-hate predictor — this likely reflects news/reporting usage in the dataset rather than targeted hate, and shows how context-free embeddings can produce counterintuitive results.

	---

	### 3.3 Hinglish Test Set

	\| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|
	\| arey, bahir, punish, papa, interior \| online, member, mam, messages, asha \|

	![SHAP Top Words — Hinglish](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_hinglish.png)

	The Hinglish markers are notably semantically coherent compared to v1 (8-epoch) models. `arey` is a Hindi-derived exclamation almost always used in confrontational tone in social media; `bahir` (outside/get out) reflects exclusionary language; `punish` as a verb signals threat. The non-hate side shows platform/conversation context (`online`, `messages`, `mam`). Starting on Hinglish for 50 epochs produced the strongest Hinglish-specific feature learning of any model trained in this project.

	---

	### 3.4 Full Test Set

	\| Top Hate Words \| Top Non-Hate Words \|
	\|---\|---\|
	\| blamed, criticized, syntax, grown, sine \| underneath, smack, online, hole, clue \|

	![SHAP Top Words — Full](output_v2/shap/v2/hinglish_to_hindi_to_english_v2/shap_topwords_full.png)

	The full-dataset view shows accusatory framing (`blamed`, `criticized`) as the dominant hate signal — more so than in any v1 model. The 50-epoch Full phase appears to have learned that hate speech in this multilingual corpus characteristically targets victims through blame and accusation, regardless of language. The spurious markers `syntax` and `sine` (very rare technical words) are outliers that co-occurred with hate content in specific training examples.

	---

	## 4. Key Findings

	### SHAP Magnitude Reflects Language Confidence

	\| Language \| Typical Top \|SHAP\| \| Why \|
	\|---\|---\|---\|
	\| English \| 0.02 – 0.06 \| GloVe covers English comprehensively — strong, directional gradients \|
	\| Hinglish \| 0.04 – 0.46 \| Roman-script words partially covered; model learned strong code-mixed patterns from Phase 1 \|
	\| Hindi \| 0.003 – 0.007 \| Devanagari script is OOV in GloVe — near-zero vectors produce tiny gradients \|

	Hindi confidence is 10× lower than English. This is a fundamental limitation of using GloVe (English-trained) for multilingual text. A multilingual model (e.g. mBERT, MuRIL) would show much more balanced confidence across languages.

	### Effect of 50-Epoch Training vs 8-Epoch

	Compared to the equivalent v1 strategy (Hinglish → Hindi → English, 8 epochs), the v2 model shows:
	- More directional Hinglish markers — `arey`, `bahir`, `punish` are contextually coherent; v1 had noisier Hinglish top words
	- Accusatory framing as primary hate signal — `blamed`, `criticized` in full eval rather than random rare words; deeper training on the Full phase produced this generalisation
	- Similar English markers — both models converge on English hate vocabulary after the English phase; the difference is mainly in non-English languages

	### Consistent Non-Hate Signal

	"online" is the most stable non-hate predictor across all four evaluation sets. It appears in informational, conversational, and reporting contexts in all three languages — the model correctly identifies this as a non-toxic context word.

	### Spurious Correlations

	`syntax` and `sine` appearing as hate markers in the full evaluation are clear spurious correlations — rare technical words that happened to co-occur with hateful content in a small number of training examples. The BiLSTM cannot distinguish this from meaningful signal because GloVe has no semantic grounding for these words in a hate/non-hate context.

	This is the primary motivation for contextual models: BERT-based approaches would represent the same word differently depending on surrounding context, eliminating most spurious co-occurrence patterns.