| # SHAP Explainability Report — Multilingual Hate Speech Detection (v2) |
|
|
| **Model:** Hinglish → Hindi → English → Full (GloVe + BiLSTM, 50 epochs/phase, 200 total) |
| **Explainer:** `shap.GradientExplainer` on the BiLSTM sub-model |
| **Background:** 200 randomly sampled training sequences |
| **Test samples:** 500 per language (random subset), full test set = 8,852 |
|
|
| --- |
|
|
| ## Table of Contents |
| 1. [What SHAP Is and How It Works Here](#1-what-shap-is-and-how-it-works-here) |
| 2. [How the Top-20 Words Are Selected](#2-how-the-top-20-words-are-selected) |
| 3. [Results by Evaluation Language](#3-results-by-evaluation-language) |
| 4. [Key Findings](#4-key-findings) |
|
|
| --- |
|
|
| ## 1. What SHAP Is and How It Works Here |
|
|
| **SHAP (SHapley Additive exPlanations)** answers the question: *"For this specific prediction, how much did each word contribute to the output?"* |
|
|
| It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input. |
|
|
| ### The Technical Challenge |
|
|
| Our model takes **integer token IDs** as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each word's influence. But you cannot take a gradient through an integer operation. |
|
|
| **Solution — split the model into two steps:** |
|
|
| ``` |
| Step 1 (outside SHAP): |
| Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300) |
| |
| Step 2 (SHAP runs here): |
| Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1) |
| ``` |
|
|
| SHAP's `GradientExplainer` runs on Step 2. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then **sum across all 300 dimensions** for each position to get one number per word — its overall contribution. |
|
|
| ### What the Numbers Mean |
|
|
| - **Positive SHAP value** → word pushes prediction **toward hate speech** (output closer to 1.0) |
| - **Negative SHAP value** → word pushes prediction **toward non-hate** (output closer to 0.0) |
| - **Magnitude** → how strongly that word influences the decision |
|
|
| ### Background Dataset |
|
|
| The explainer needs a "baseline" — what does the model output with no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution *relative to this average*. |
|
|
| --- |
|
|
| ## 2. How the Top-20 Words Are Selected |
|
|
| This is **not random**. Here is the exact process: |
|
|
| **Step 1 — Run SHAP on the test set** |
| For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence → 100 SHAP values. |
|
|
| **Step 2 — Aggregate by word across all examples** |
| For each unique word appearing in those 500 examples, compute the **mean SHAP value** across all its occurrences: |
|
|
| ``` |
| word "blamed" appears 31 times across 500 examples |
| SHAP values: [+0.061, +0.058, +0.072, ..., +0.055] |
| Mean SHAP = sum / 31 = +0.064 |
| ``` |
|
|
| **Step 3 — Rank by absolute mean SHAP** |
| Words sorted by `|mean SHAP|` — captures both strong hate predictors (large positive) and strong non-hate predictors (large negative). |
|
|
| **Step 4 — Show top 20** |
| The 20 words with the highest `|mean SHAP|` are plotted: |
| - 🔴 Red bars extending right = pushes toward **hate** |
| - 🔵 Blue bars extending left = pushes toward **non-hate** |
|
|
| Padding tokens (ID=0) are always excluded. Only real words contribute. |
|
|
| --- |
|
|
| ## 3. Results by Evaluation Language |
|
|
| ### 3.1 English Test Set |
|
|
| | Top Hate Words | Top Non-Hate Words | |
| |---|---| |
| | nas, fags, sicko, sabotage, advocating | grow, barrel, homosexual, pak, join | |
|
|
|  |
|
|
| The English hate markers skew toward **explicit aggression** (`sicko`, `fags`) and **intentional framing** (`sabotage`, `advocating`) — the 50-epoch English phase produced sharper English feature learning than shorter-training models. The appearance of `homosexual` as a non-hate word reflects that this term appears more in informational/news text in the dataset than in targeted slurs. |
|
|
| --- |
|
|
| ### 3.2 Hindi Test Set |
|
|
| | Top Hate Words | Top Non-Hate Words | |
| |---|---| |
| | वादा, वैज्ञानिकों, ऐ, उतारा, गला | जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो | |
|
|
|  |
|
|
| **Note on bar size:** Hindi SHAP values are approximately 10× smaller than English (range ~0.003–0.007 vs ~0.02–0.06). This is not a model failure — it reflects that GloVe was trained on English text and has near-zero vectors for Devanagari words. The model makes Hindi predictions from positional/sequential patterns rather than word semantics. The words `गला` (throat — as in "गला घोंटना", strangle) and `ऐ` (hey — aggressive address) showing as hate markers suggest the model did capture some Hindi-specific abuse patterns through the 50-epoch Hinglish-first training. |
|
|
| Interestingly, `जिहादी` (jihadist) appears as a **non-hate** predictor — this likely reflects news/reporting usage in the dataset rather than targeted hate, and shows how context-free embeddings can produce counterintuitive results. |
|
|
| --- |
|
|
| ### 3.3 Hinglish Test Set |
|
|
| | Top Hate Words | Top Non-Hate Words | |
| |---|---| |
| | arey, bahir, punish, papa, interior | online, member, mam, messages, asha | |
|
|
|  |
|
|
| The Hinglish markers are notably **semantically coherent** compared to v1 (8-epoch) models. `arey` is a Hindi-derived exclamation almost always used in confrontational tone in social media; `bahir` (outside/get out) reflects exclusionary language; `punish` as a verb signals threat. The non-hate side shows platform/conversation context (`online`, `messages`, `mam`). Starting on Hinglish for 50 epochs produced the strongest Hinglish-specific feature learning of any model trained in this project. |
|
|
| --- |
|
|
| ### 3.4 Full Test Set |
|
|
| | Top Hate Words | Top Non-Hate Words | |
| |---|---| |
| | blamed, criticized, syntax, grown, sine | underneath, smack, online, hole, clue | |
|
|
|  |
|
|
| The full-dataset view shows **accusatory framing** (`blamed`, `criticized`) as the dominant hate signal — more so than in any v1 model. The 50-epoch Full phase appears to have learned that hate speech in this multilingual corpus characteristically targets victims through blame and accusation, regardless of language. The spurious markers `syntax` and `sine` (very rare technical words) are outliers that co-occurred with hate content in specific training examples. |
|
|
| --- |
|
|
| ## 4. Key Findings |
|
|
| ### SHAP Magnitude Reflects Language Confidence |
|
|
| | Language | Typical Top |SHAP| | Why | |
| |---|---|---| |
| | English | 0.02 – 0.06 | GloVe covers English comprehensively — strong, directional gradients | |
| | Hinglish | 0.04 – 0.46 | Roman-script words partially covered; model learned strong code-mixed patterns from Phase 1 | |
| | Hindi | 0.003 – 0.007 | Devanagari script is OOV in GloVe — near-zero vectors produce tiny gradients | |
|
|
| Hindi confidence is 10× lower than English. This is a fundamental limitation of using GloVe (English-trained) for multilingual text. A multilingual model (e.g. mBERT, MuRIL) would show much more balanced confidence across languages. |
|
|
| ### Effect of 50-Epoch Training vs 8-Epoch |
|
|
| Compared to the equivalent v1 strategy (Hinglish → Hindi → English, 8 epochs), the v2 model shows: |
| - **More directional Hinglish markers** — `arey`, `bahir`, `punish` are contextually coherent; v1 had noisier Hinglish top words |
| - **Accusatory framing as primary hate signal** — `blamed`, `criticized` in full eval rather than random rare words; deeper training on the Full phase produced this generalisation |
| - **Similar English markers** — both models converge on English hate vocabulary after the English phase; the difference is mainly in non-English languages |
|
|
| ### Consistent Non-Hate Signal |
|
|
| **"online"** is the most stable non-hate predictor across all four evaluation sets. It appears in informational, conversational, and reporting contexts in all three languages — the model correctly identifies this as a non-toxic context word. |
|
|
| ### Spurious Correlations |
|
|
| `syntax` and `sine` appearing as hate markers in the full evaluation are clear spurious correlations — rare technical words that happened to co-occur with hateful content in a small number of training examples. The BiLSTM cannot distinguish this from meaningful signal because GloVe has no semantic grounding for these words in a hate/non-hate context. |
|
|
| This is the primary motivation for contextual models: BERT-based approaches would represent the same word differently depending on surrounding context, eliminating most spurious co-occurrence patterns. |
|
|