# SHAP Explainability Report — Multilingual Hate Speech Detection (v1) **Models:** All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase) **Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model **Background:** 200 randomly sampled training sequences **Test samples:** 500 per language (random subset), full test set = 8,852 --- ## Table of Contents 1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here) 2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected) 3. [Strategy 1 — English → Hindi → Hinglish](#3-strategy-1--english--hindi--hinglish) 4. [Strategy 2 — English → Hinglish → Hindi](#4-strategy-2--english--hinglish--hindi) 5. [Strategy 3 — Hindi → English → Hinglish ⭐ Best](#5-strategy-3--hindi--english--hinglish--best) 6. [Strategy 4 — Hindi → Hinglish → English](#6-strategy-4--hindi--hinglish--english) 7. [Strategy 5 — Hinglish → English → Hindi](#7-strategy-5--hinglish--english--hindi) 8. [Strategy 6 — Hinglish → Hindi → English](#8-strategy-6--hinglish--hindi--english) 9. [Cross-Strategy Comparison](#9-cross-strategy-comparison) 10. [Key Findings](#10-key-findings) --- ## 1. What SHAP Is and How It Works Here **SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"* It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input. ### The Technical Challenge Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation. **Solution — split the model into two steps:** ``` Step 1 (outside SHAP): Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300) Step 2 (SHAP runs here): Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1) ``` SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word. ### What the Numbers Mean - **Positive SHAP value** → word pushes the prediction **toward hate speech** (output closer to 1.0) - **Negative SHAP value** → word pushes the prediction **toward non-hate** (output closer to 0.0) - **Magnitude** (how large the number is) → how strongly that word influences the decision ### Background Dataset The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this baseline*. --- ## 2. How the Top-20 Words Are Selected This is **not random**. Here is the exact process: **Step 1 — Run SHAP on the test set** For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values. **Step 2 — Aggregate by word across all examples** For each unique word that appears in those 500 examples, we compute the **mean SHAP value** across every occurrence of that word: ``` word "blame" appears 47 times across 500 examples SHAP values: [+0.031, +0.028, +0.019, ..., +0.034] Mean SHAP = sum of all values / 47 = +0.026 ``` **Step 3 — Rank by absolute mean SHAP** Words are sorted by `|mean SHAP|` — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative). **Step 4 — Show top 20** The 20 words with the highest `|mean SHAP|` are plotted. The colour tells you the direction: - 🔴 Red bars extending right = pushes toward **hate** - 🔵 Blue bars extending left = pushes toward **non-hate** **Important:** padding tokens (ID=0) are excluded. Only real words contribute. --- ## 3. Strategy 1 — English → Hindi → Hinglish | Eval On | Top Hate Words | Top Non-Hate Words | |---|---|---| | English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum | | Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह | | Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels | | Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine | | Eval | Top Words Plot | |---|---| | English | ![](shap/english_to_hindi_to_hinglish/shap_topwords_english.png) | | Hindi | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hindi.png) | | Hinglish | ![](shap/english_to_hindi_to_hinglish/shap_topwords_hinglish.png) | | Full | ![](shap/english_to_hindi_to_hinglish/shap_topwords_full.png) | **Note on Hindi:** The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning. --- ## 4. Strategy 2 — English → Hinglish → Hindi | Eval On | Top Hate Words | Top Non-Hate Words | |---|---|---| | English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally | | Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ | | Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf | | Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal | | Eval | Top Words Plot | |---|---| | English | ![](shap/english_to_hinglish_to_hindi/shap_topwords_english.png) | | Hindi | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hindi.png) | | Hinglish | ![](shap/english_to_hinglish_to_hindi/shap_topwords_hinglish.png) | | Full | ![](shap/english_to_hinglish_to_hindi/shap_topwords_full.png) | --- ## 5. Strategy 3 — Hindi → English → Hinglish ⭐ Best This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set). | Eval On | Top Hate Words | Top Non-Hate Words | |---|---|---| | English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise | | Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े | | Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd | | Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha | | Eval | Top Words Plot | |---|---| | English | ![](shap/hindi_to_english_to_hinglish/shap_topwords_english.png) | | Hindi | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hindi.png) | | Hinglish | ![](shap/hindi_to_english_to_hinglish/shap_topwords_hinglish.png) | | Full | ![](shap/hindi_to_english_to_hinglish/shap_topwords_full.png) | Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (`bacchi`, `behan` — familial insults common in South Asian abuse; `srk` — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies. --- ## 6. Strategy 4 — Hindi → Hinglish → English | Eval On | Top Hate Words | Top Non-Hate Words | |---|---|---| | English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total | | Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल | | Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target | | Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam | | Eval | Top Words Plot | |---|---| | English | ![](shap/hindi_to_hinglish_to_english/shap_topwords_english.png) | | Hindi | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hindi.png) | | Hinglish | ![](shap/hindi_to_hinglish_to_english/shap_topwords_hinglish.png) | | Full | ![](shap/hindi_to_hinglish_to_english/shap_topwords_full.png) | --- ## 7. Strategy 5 — Hinglish → English → Hindi | Eval On | Top Hate Words | Top Non-Hate Words | |---|---|---| | English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear | | Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि | | Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music | | Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine | | Eval | Top Words Plot | |---|---| | English | ![](shap/hinglish_to_english_to_hindi/shap_topwords_english.png) | | Hindi | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hindi.png) | | Hinglish | ![](shap/hinglish_to_english_to_hindi/shap_topwords_hinglish.png) | | Full | ![](shap/hinglish_to_english_to_hindi/shap_topwords_full.png) | **Interesting:** Hindi non-hate words in this strategy include `चूतिए` and `मुल्ले` — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive. --- ## 8. Strategy 6 — Hinglish → Hindi → English | Eval On | Top Hate Words | Top Non-Hate Words | |---|---|---| | English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest | | Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा | | Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya | | Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel | | Eval | Top Words Plot | |---|---| | English | ![](shap/hinglish_to_hindi_to_english/shap_topwords_english.png) | | Hindi | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hindi.png) | | Hinglish | ![](shap/hinglish_to_hindi_to_english/shap_topwords_hinglish.png) | | Full | ![](shap/hinglish_to_hindi_to_english/shap_topwords_full.png) | --- ## 9. Cross-Strategy Comparison The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral. **English test set:** ![](shap/cross_model_comparison_english.png) **Hindi test set:** ![](shap/cross_model_comparison_hindi.png) **Hinglish test set:** ![](shap/cross_model_comparison_hinglish.png) **Full test set:** ![](shap/cross_model_comparison_full.png) --- ## 10. Key Findings ### SHAP Magnitude Reflects Language Confidence | Language | Typical Top |SHAP| | Why | |---|---|---| | English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors | | Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns | | Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients | This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning. ### Consistent Non-Hate Signals Across All 6 Models - **"online"** — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting. - **"rajya"** (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate. - **"maine"** (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks. ### Hate Speech Markers Are Linguistically Coherent **English:** Direct accusation verbs (`blame`, `blaming`, `criticized`) and violence vocabulary (`massacres`, `cleansing`, `violence`) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation. **Hinglish:** Familial insults (`behan` = sister, `bacchi` = girl/child — used in gendered abuse), aggressive interjections (`arey`, `abb`, `ruk`), and celebrity/political names in abusive context (`srk`, `krk`, `dada`) — reflects South Asian social media abuse patterns. **Hindi:** Body/violence metaphors (`गला` = throat, as in strangle; `मूर्ख` = fool) and politically charged proper nouns (`भूमिपूजन` = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period). ### Visible Spurious Correlations Several high-SHAP words are clearly not meaningful hate indicators: - **`skua`** (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence. - **`ahh`** — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word. - **`coon`** appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset). These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.