SASC / SHAP_REPORT.md
tuklu's picture
Update SHAP report — v1 isolated, full methodology
5f3cea2 verified
# SHAP Explainability Report — Multilingual Hate Speech Detection (v1)
**Models:** All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase)
**Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model
**Background:** 200 randomly sampled training sequences
**Test samples:** 500 per language (random subset), full test set = 8,852
---
## Table of Contents
1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here)
2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected)
3. [Strategy 1 — English → Hindi → Hinglish](#3-strategy-1--english--hindi--hinglish)
4. [Strategy 2 — English → Hinglish → Hindi](#4-strategy-2--english--hinglish--hindi)
5. [Strategy 3 — Hindi → English → Hinglish ⭐ Best](#5-strategy-3--hindi--english--hinglish--best)
6. [Strategy 4 — Hindi → Hinglish → English](#6-strategy-4--hindi--hinglish--english)
7. [Strategy 5 — Hinglish → English → Hindi](#7-strategy-5--hinglish--english--hindi)
8. [Strategy 6 — Hinglish → Hindi → English](#8-strategy-6--hinglish--hindi--english)
9. [Cross-Strategy Comparison](#9-cross-strategy-comparison)
10. [Key Findings](#10-key-findings)
---
## 1. What SHAP Is and How It Works Here
**SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"*
It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
### The Technical Challenge
Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.
**Solution — split the model into two steps:**
```
Step 1 (outside SHAP):
Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)
Step 2 (SHAP runs here):
Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
```
SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word.
### What the Numbers Mean
- **Positive SHAP value** → word pushes the prediction **toward hate speech** (output closer to 1.0)
- **Negative SHAP value** → word pushes the prediction **toward non-hate** (output closer to 0.0)
- **Magnitude** (how large the number is) → how strongly that word influences the decision
### Background Dataset
The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this baseline*.
---
## 2. How the Top-20 Words Are Selected
This is **not random**. Here is the exact process:
**Step 1 — Run SHAP on the test set**
For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.
**Step 2 — Aggregate by word across all examples**
For each unique word that appears in those 500 examples, we compute the **mean SHAP value** across every occurrence of that word:
```
word "blame" appears 47 times across 500 examples
SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
Mean SHAP = sum of all values / 47 = +0.026
```
**Step 3 — Rank by absolute mean SHAP**
Words are sorted by `|mean SHAP|` — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
**Step 4 — Show top 20**
The 20 words with the highest `|mean SHAP|` are plotted. The colour tells you the direction:
- 🔴 Red bars extending right = pushes toward **hate**
- 🔵 Blue bars extending left = pushes toward **non-hate**
**Important:** padding tokens (ID=0) are excluded. Only real words contribute.
---
## 3. Strategy 1 — English → Hindi → Hinglish
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
| Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
| Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
| Eval | Top Words Plot |
|---|---|
| English | ![](shap/english_to_hindi_to_hinglish/shap_topwords_english.png) |
| Hindi | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hindi.png) |
| Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) |
| Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) |
**Note on Hindi:** The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.
---
## 4. Strategy 2 — English → Hinglish → Hindi
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
| Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
| Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
| Eval | Top Words Plot |
|---|---|
| English | ![](shap/english_to_hinglish_to_hindi/shap_topwords_english.png) |
| Hindi | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hindi.png) |
| Hinglish | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hinglish.png) |
| Full | ![](shap/english_to_hinglish_to_hindi/shap_topwords_full.png) |
---
## 5. Strategy 3 — Hindi → English → Hinglish ⭐ Best
This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
| Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
| Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
| Eval | Top Words Plot |
|---|---|
| English | ![](shap/hindi_to_english_to_hinglish/shap_topwords_english.png) |
| Hindi | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hindi.png) |
| Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) |
| Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) |
Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (`bacchi`, `behan` — familial insults common in South Asian abuse; `srk` — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.
---
## 6. Strategy 4 — Hindi → Hinglish → English
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
| Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
| Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
| Eval | Top Words Plot |
|---|---|
| English | ![](shap/hindi_to_hinglish_to_english/shap_topwords_english.png) |
| Hindi | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hindi.png) |
| Hinglish | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hinglish.png) |
| Full | ![](shap/hindi_to_hinglish_to_english/shap_topwords_full.png) |
---
## 7. Strategy 5 — Hinglish → English → Hindi
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि |
| Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
| Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
| Eval | Top Words Plot |
|---|---|
| English | ![](shap/hinglish_to_english_to_hindi/shap_topwords_english.png) |
| Hindi | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hindi.png) |
| Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) |
| Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) |
**Interesting:** Hindi non-hate words in this strategy include `चूतिए` and `मुल्ले` — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.
---
## 8. Strategy 6 — Hinglish → Hindi → English
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
| Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
| Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
| Eval | Top Words Plot |
|---|---|
| English | ![](shap/hinglish_to_hindi_to_english/shap_topwords_english.png) |
| Hindi | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hindi.png) |
| Hinglish | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hinglish.png) |
| Full | ![](shap/hinglish_to_hindi_to_english/shap_topwords_full.png) |
---
## 9. Cross-Strategy Comparison
The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.
**English test set:**
![](shap/cross_model_comparison_english.png)
**Hindi test set:**
![](shap/cross_model_comparison_hindi.png)
**Hinglish test set:**
![](shap/cross_model_comparison_hinglish.png)
**Full test set:**
![](shap/cross_model_comparison_full.png)
---
## 10. Key Findings
### SHAP Magnitude Reflects Language Confidence
| Language | Typical Top |SHAP| | Why |
|---|---|---|
| English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors |
| Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns |
| Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients |
This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.
### Consistent Non-Hate Signals Across All 6 Models
- **"online"** — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
- **"rajya"** (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
- **"maine"** (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.
### Hate Speech Markers Are Linguistically Coherent
**English:** Direct accusation verbs (`blame`, `blaming`, `criticized`) and violence vocabulary (`massacres`, `cleansing`, `violence`) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.
**Hinglish:** Familial insults (`behan` = sister, `bacchi` = girl/child — used in gendered abuse), aggressive interjections (`arey`, `abb`, `ruk`), and celebrity/political names in abusive context (`srk`, `krk`, `dada`) — reflects South Asian social media abuse patterns.
**Hindi:** Body/violence metaphors (`गला` = throat, as in strangle; `मूर्ख` = fool) and politically charged proper nouns (`भूमिपूजन` = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).
### Visible Spurious Correlations
Several high-SHAP words are clearly not meaningful hate indicators:
- **`skua`** (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
- **`ahh`** — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
- **`coon`** appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).
These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.