SHAP Explainability Report — Multilingual Hate Speech Detection (v1)
Models: All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase)
Explainer: shap.GradientExplainer on the BiLSTM sub-model
Background: 200 randomly sampled training sequences
Test samples: 500 per language (random subset), full test set = 8,852
Table of Contents
- What SHAP Is and How It Works Here
- How the Top-20 Words Are Selected
- Strategy 1 — English → Hindi → Hinglish
- Strategy 2 — English → Hinglish → Hindi
- Strategy 3 — Hindi → English → Hinglish ⭐ Best
- Strategy 4 — Hindi → Hinglish → English
- Strategy 5 — Hinglish → English → Hindi
- Strategy 6 — Hinglish → Hindi → English
- Cross-Strategy Comparison
- Key Findings
1. What SHAP Is and How It Works Here
SHAP (SHapley Additive exPlanations) answers the question: "For this specific prediction, how much did each word contribute to the output?"
It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.
The Technical Challenge
Our model takes integer token IDs as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.
Solution — split the model into two steps:
Step 1 (outside SHAP):
Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)
Step 2 (SHAP runs here):
Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)
SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then sum across all 300 dimensions for each position to get one number per word.
What the Numbers Mean
- Positive SHAP value → word pushes the prediction toward hate speech (output closer to 1.0)
- Negative SHAP value → word pushes the prediction toward non-hate (output closer to 0.0)
- Magnitude (how large the number is) → how strongly that word influences the decision
Background Dataset
The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution relative to this baseline.
2. How the Top-20 Words Are Selected
This is not random. Here is the exact process:
Step 1 — Run SHAP on the test set For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.
Step 2 — Aggregate by word across all examples For each unique word that appears in those 500 examples, we compute the mean SHAP value across every occurrence of that word:
word "blame" appears 47 times across 500 examples
SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
Mean SHAP = sum of all values / 47 = +0.026
Step 3 — Rank by absolute mean SHAP
Words are sorted by |mean SHAP| — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).
Step 4 — Show top 20
The 20 words with the highest |mean SHAP| are plotted. The colour tells you the direction:
- 🔴 Red bars extending right = pushes toward hate
- 🔵 Blue bars extending left = pushes toward non-hate
Important: padding tokens (ID=0) are excluded. Only real words contribute.
3. Strategy 1 — English → Hindi → Hinglish
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | blame, cretin, blaming, unhelpful, upwards | nt, wwf, facebook, ahh, cum |
| Hindi | रे, पढ़ने, लाइव, गाला, औखत | आखिरकार, मुख, बिकुल, शायद, ह |
| Hinglish | nawaz, dhawan, bashing, shareef, scn | gau, age, rajya, chori, channels |
| Full | molvi, chalo, molana, scn, elitist | rajya, coding, meat, haan, maine |
Note on Hindi: The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.
4. Strategy 2 — English → Hinglish → Hindi
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | grave, svi, vox, ahh, grown | buried, coon, ane, million, normally |
| Hindi | नड्डा, हिसक, बड़े, सांसदों, रद्दीके | समाज, सबकी, सऊदी, समझा, अमिताभ |
| Hinglish | khi, dada, kiske, srk, chalo | gau, tk, online, liberty, taraf |
| Full | dada, roj, shamelessness, tujhe, epic | akash, sapna, proud, buddy, episcopal |
5. Strategy 3 — Hindi → English → Hinglish ⭐ Best
This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | credence, bj, rosario, ghazi, eni | plain, stranger, sarcasm, rubbish, comprise |
| Hindi | कॉल, भूमिपूजन, लें, आधी, मूर्ख | मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े |
| Hinglish | bacchi, bull, srk, bahana, behan | madrassa, zaida, gdp, bech, nd |
| Full | skua, brut, cleansing, captaincy, baar | taraf, pussy, directory, quran, kaha |
Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (bacchi, behan — familial insults common in South Asian abuse; srk — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.
6. Strategy 4 — Hindi → Hinglish → English
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | violence, ahh, potus, spic, undocumented | beginner, dollars, bih, messages, total |
| Hindi | जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय | नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल |
| Hinglish | tatti, sham, dino, roko, krk | lac, online, ancestor, zaida, target |
| Full | ahh, moi, bj, fault, pan | asperger, hundred, database, wicked, nam |
7. Strategy 5 — Hinglish → English → Hindi
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | bastard, establishes, code, poo, hub | blatantly, languages, turkey, fags, gear |
| Hindi | रंजन, गोगोई, नड्डा, सांसदों, पित्त | चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि |
| Hinglish | huye, dada, abb, arey, abduction | rajya, bahu, parliament, code, music |
| Full | skua, praised, spic, sabse, plz | liberty, languages, speaks, bache, maine |
Interesting: Hindi non-hate words in this strategy include चूतिए and मुल्ले — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.
8. Strategy 6 — Hinglish → Hindi → English
| Eval On | Top Hate Words | Top Non-Hate Words |
|---|---|---|
| English | opponents, massacres, coon, ahh, fitness | annie, model, nearly, lloyd, nest |
| Hindi | लें, अमिताभ, मी, करेंगे, रखता | आंखें, जे, बुरे, लोगे, जायज़ा |
| Hinglish | fav, janab, chori, cum, ruk | online, gau, dehli, 2017, rajya |
| Full | srk, roj, rhi, purana, aapke | nest, maine, hone, haired, barrel |
9. Cross-Strategy Comparison
The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.
10. Key Findings
SHAP Magnitude Reflects Language Confidence
| Language | Typical Top |SHAP| | Why | |---|---|---| | English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors | | Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns | | Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients |
This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.
Consistent Non-Hate Signals Across All 6 Models
- "online" — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
- "rajya" (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
- "maine" (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.
Hate Speech Markers Are Linguistically Coherent
English: Direct accusation verbs (blame, blaming, criticized) and violence vocabulary (massacres, cleansing, violence) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.
Hinglish: Familial insults (behan = sister, bacchi = girl/child — used in gendered abuse), aggressive interjections (arey, abb, ruk), and celebrity/political names in abusive context (srk, krk, dada) — reflects South Asian social media abuse patterns.
Hindi: Body/violence metaphors (गला = throat, as in strangle; मूर्ख = fool) and politically charged proper nouns (भूमिपूजन = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).
Visible Spurious Correlations
Several high-SHAP words are clearly not meaningful hate indicators:
skua(a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.ahh— exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.coonappearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).
These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.



























