SASC / SHAP_REPORT.md
tuklu's picture
Update SHAP report — v1 isolated, full methodology
5f3cea2 verified

SHAP Explainability Report — Multilingual Hate Speech Detection (v1)

Models: All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase) Explainer: shap.GradientExplainer on the BiLSTM sub-model Background: 200 randomly sampled training sequences Test samples: 500 per language (random subset), full test set = 8,852


Table of Contents

  1. What SHAP Is and How It Works Here
  2. How the Top-20 Words Are Selected
  3. Strategy 1 — English → Hindi → Hinglish
  4. Strategy 2 — English → Hinglish → Hindi
  5. Strategy 3 — Hindi → English → Hinglish ⭐ Best
  6. Strategy 4 — Hindi → Hinglish → English
  7. Strategy 5 — Hinglish → English → Hindi
  8. Strategy 6 — Hinglish → Hindi → English
  9. Cross-Strategy Comparison
  10. Key Findings

1. What SHAP Is and How It Works Here

SHAP (SHapley Additive exPlanations) answers the question: "For this specific prediction, how much did each word contribute to the output?"

It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.

The Technical Challenge

Our model takes integer token IDs as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.

Solution — split the model into two steps:

Step 1 (outside SHAP):
  Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)

Step 2 (SHAP runs here):
  Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)

SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then sum across all 300 dimensions for each position to get one number per word.

What the Numbers Mean

  • Positive SHAP value → word pushes the prediction toward hate speech (output closer to 1.0)
  • Negative SHAP value → word pushes the prediction toward non-hate (output closer to 0.0)
  • Magnitude (how large the number is) → how strongly that word influences the decision

Background Dataset

The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution relative to this baseline.


2. How the Top-20 Words Are Selected

This is not random. Here is the exact process:

Step 1 — Run SHAP on the test set For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.

Step 2 — Aggregate by word across all examples For each unique word that appears in those 500 examples, we compute the mean SHAP value across every occurrence of that word:

word "blame" appears 47 times across 500 examples
SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
Mean SHAP = sum of all values / 47 = +0.026

Step 3 — Rank by absolute mean SHAP Words are sorted by |mean SHAP| — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).

Step 4 — Show top 20 The 20 words with the highest |mean SHAP| are plotted. The colour tells you the direction:

  • 🔴 Red bars extending right = pushes toward hate
  • 🔵 Blue bars extending left = pushes toward non-hate

Important: padding tokens (ID=0) are excluded. Only real words contribute.


3. Strategy 1 — English → Hindi → Hinglish

Eval On Top Hate Words Top Non-Hate Words
English blame, cretin, blaming, unhelpful, upwards nt, wwf, facebook, ahh, cum
Hindi रे, पढ़ने, लाइव, गाला, औखत आखिरकार, मुख, बिकुल, शायद, ह
Hinglish nawaz, dhawan, bashing, shareef, scn gau, age, rajya, chori, channels
Full molvi, chalo, molana, scn, elitist rajya, coding, meat, haan, maine
Eval Top Words Plot
English
Hindi
Hinglish
Full

Note on Hindi: The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.


4. Strategy 2 — English → Hinglish → Hindi

Eval On Top Hate Words Top Non-Hate Words
English grave, svi, vox, ahh, grown buried, coon, ane, million, normally
Hindi नड्डा, हिसक, बड़े, सांसदों, रद्दीके समाज, सबकी, सऊदी, समझा, अमिताभ
Hinglish khi, dada, kiske, srk, chalo gau, tk, online, liberty, taraf
Full dada, roj, shamelessness, tujhe, epic akash, sapna, proud, buddy, episcopal
Eval Top Words Plot
English
Hindi
Hinglish
Full

5. Strategy 3 — Hindi → English → Hinglish ⭐ Best

This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).

Eval On Top Hate Words Top Non-Hate Words
English credence, bj, rosario, ghazi, eni plain, stranger, sarcasm, rubbish, comprise
Hindi कॉल, भूमिपूजन, लें, आधी, मूर्ख मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े
Hinglish bacchi, bull, srk, bahana, behan madrassa, zaida, gdp, bech, nd
Full skua, brut, cleansing, captaincy, baar taraf, pussy, directory, quran, kaha
Eval Top Words Plot
English
Hindi
Hinglish
Full

Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (bacchi, behan — familial insults common in South Asian abuse; srk — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.


6. Strategy 4 — Hindi → Hinglish → English

Eval On Top Hate Words Top Non-Hate Words
English violence, ahh, potus, spic, undocumented beginner, dollars, bih, messages, total
Hindi जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल
Hinglish tatti, sham, dino, roko, krk lac, online, ancestor, zaida, target
Full ahh, moi, bj, fault, pan asperger, hundred, database, wicked, nam
Eval Top Words Plot
English
Hindi
Hinglish
Full

7. Strategy 5 — Hinglish → English → Hindi

Eval On Top Hate Words Top Non-Hate Words
English bastard, establishes, code, poo, hub blatantly, languages, turkey, fags, gear
Hindi रंजन, गोगोई, नड्डा, सांसदों, पित्त चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि
Hinglish huye, dada, abb, arey, abduction rajya, bahu, parliament, code, music
Full skua, praised, spic, sabse, plz liberty, languages, speaks, bache, maine
Eval Top Words Plot
English
Hindi
Hinglish
Full

Interesting: Hindi non-hate words in this strategy include चूतिए and मुल्ले — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.


8. Strategy 6 — Hinglish → Hindi → English

Eval On Top Hate Words Top Non-Hate Words
English opponents, massacres, coon, ahh, fitness annie, model, nearly, lloyd, nest
Hindi लें, अमिताभ, मी, करेंगे, रखता आंखें, जे, बुरे, लोगे, जायज़ा
Hinglish fav, janab, chori, cum, ruk online, gau, dehli, 2017, rajya
Full srk, roj, rhi, purana, aapke nest, maine, hone, haired, barrel
Eval Top Words Plot
English
Hindi
Hinglish
Full

9. Cross-Strategy Comparison

The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.

English test set:

Hindi test set:

Hinglish test set:

Full test set:


10. Key Findings

SHAP Magnitude Reflects Language Confidence

| Language | Typical Top |SHAP| | Why | |---|---|---| | English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors | | Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns | | Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients |

This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.

Consistent Non-Hate Signals Across All 6 Models

  • "online" — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
  • "rajya" (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
  • "maine" (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.

Hate Speech Markers Are Linguistically Coherent

English: Direct accusation verbs (blame, blaming, criticized) and violence vocabulary (massacres, cleansing, violence) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.

Hinglish: Familial insults (behan = sister, bacchi = girl/child — used in gendered abuse), aggressive interjections (arey, abb, ruk), and celebrity/political names in abusive context (srk, krk, dada) — reflects South Asian social media abuse patterns.

Hindi: Body/violence metaphors (गला = throat, as in strangle; मूर्ख = fool) and politically charged proper nouns (भूमिपूजन = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).

Visible Spurious Correlations

Several high-SHAP words are clearly not meaningful hate indicators:

  • skua (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
  • ahh — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
  • coon appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).

These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.