SASC / SHAP_REPORT.md

Update SHAP report — v1 isolated, full methodology

5f3cea2 verified 5 days ago

14 kB

SHAP Explainability Report — Multilingual Hate Speech Detection (v1)

Models: All 6 sequential transfer learning strategies (GloVe + BiLSTM, 8 epochs/phase) Explainer: shap.GradientExplainer on the BiLSTM sub-model Background: 200 randomly sampled training sequences Test samples: 500 per language (random subset), full test set = 8,852

What SHAP Is and How It Works Here
How the Top-20 Words Are Selected
Strategy 1 — English → Hindi → Hinglish
Strategy 2 — English → Hinglish → Hindi
Strategy 3 — Hindi → English → Hinglish ⭐ Best
Strategy 4 — Hindi → Hinglish → English
Strategy 5 — Hinglish → English → Hindi
Strategy 6 — Hinglish → Hindi → English
Cross-Strategy Comparison
Key Findings

1. What SHAP Is and How It Works Here

SHAP (SHapley Additive exPlanations) answers the question: "For this specific prediction, how much did each word contribute to the output?"

It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.

The Technical Challenge

Our model takes integer token IDs as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each input's influence. But you cannot take a gradient through an integer operation.

Solution — split the model into two steps:

Step 1 (outside SHAP):
  Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)

Step 2 (SHAP runs here):
  Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)

SHAP's GradientExplainer runs on Step 2 only. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then sum across all 300 dimensions for each position to get one number per word.

What the Numbers Mean

Positive SHAP value → word pushes the prediction toward hate speech (output closer to 1.0)
Negative SHAP value → word pushes the prediction toward non-hate (output closer to 0.0)
Magnitude (how large the number is) → how strongly that word influences the decision

Background Dataset

The explainer needs a "baseline" — what does the model output when there is no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution relative to this baseline.

2. How the Top-20 Words Are Selected

This is not random. Here is the exact process:

Step 1 — Run SHAP on the test set For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence produces 100 SHAP values.

Step 2 — Aggregate by word across all examples For each unique word that appears in those 500 examples, we compute the mean SHAP value across every occurrence of that word:

word "blame" appears 47 times across 500 examples
SHAP values: [+0.031, +0.028, +0.019, ..., +0.034]
Mean SHAP = sum of all values / 47 = +0.026

Step 3 — Rank by absolute mean SHAP Words are sorted by |mean SHAP| — this captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).

Step 4 — Show top 20 The 20 words with the highest |mean SHAP| are plotted. The colour tells you the direction:

🔴 Red bars extending right = pushes toward hate
🔵 Blue bars extending left = pushes toward non-hate

Important: padding tokens (ID=0) are excluded. Only real words contribute.

3. Strategy 1 — English → Hindi → Hinglish

Eval On	Top Hate Words	Top Non-Hate Words
English	blame, cretin, blaming, unhelpful, upwards	nt, wwf, facebook, ahh, cum
Hindi	रे, पढ़ने, लाइव, गाला, औखत	आखिरकार, मुख, बिकुल, शायद, ह
Hinglish	nawaz, dhawan, bashing, shareef, scn	gau, age, rajya, chori, channels
Full	molvi, chalo, molana, scn, elitist	rajya, coding, meat, haan, maine

Eval	Top Words Plot
English
Hindi
Hinglish
Full

Note on Hindi: The bar lengths for Hindi are much shorter than English/Hinglish. This is expected — GloVe was trained on English text so Hindi words have near-zero embedding vectors. The model's gradients through those zero vectors are tiny, producing very small SHAP values. The model is making Hindi predictions mostly from position/sequence patterns rather than word meaning.

4. Strategy 2 — English → Hinglish → Hindi

Eval On	Top Hate Words	Top Non-Hate Words
English	grave, svi, vox, ahh, grown	buried, coon, ane, million, normally
Hindi	नड्डा, हिसक, बड़े, सांसदों, रद्दीके	समाज, सबकी, सऊदी, समझा, अमिताभ
Hinglish	khi, dada, kiske, srk, chalo	gau, tk, online, liberty, taraf
Full	dada, roj, shamelessness, tujhe, epic	akash, sapna, proud, buddy, episcopal

Eval	Top Words Plot
English
Hindi
Hinglish
Full

5. Strategy 3 — Hindi → English → Hinglish ⭐ Best

This is the best performing strategy (F1=0.6419, AUC=0.7528 on full test set).

Eval On	Top Hate Words	Top Non-Hate Words
English	credence, bj, rosario, ghazi, eni	plain, stranger, sarcasm, rubbish, comprise
Hindi	कॉल, भूमिपूजन, लें, आधी, मूर्ख	मैसेज, बेमन, पुलिसकर्मी, जाएगी, पड़े
Hinglish	bacchi, bull, srk, bahana, behan	madrassa, zaida, gdp, bech, nd
Full	skua, brut, cleansing, captaincy, baar	taraf, pussy, directory, quran, kaha

Eval	Top Words Plot
English
Hindi
Hinglish
Full

Starting with Hindi forced the model to develop pattern-based hate detection before seeing GloVe-rich English vocabulary. The Hinglish hate markers (bacchi, behan — familial insults common in South Asian abuse; srk — a celebrity name frequently targeted in communal hate) are more semantically coherent than in English-first strategies.

6. Strategy 4 — Hindi → Hinglish → English

Eval On	Top Hate Words	Top Non-Hate Words
English	violence, ahh, potus, spic, undocumented	beginner, dollars, bih, messages, total
Hindi	जाएगी, दूसरों, इंटरव्यू, हवाई, अक्षय	नियम, डाला।ये, दर्शन, मुलाकात, उज्ज्वल
Hinglish	tatti, sham, dino, roko, krk	lac, online, ancestor, zaida, target
Full	ahh, moi, bj, fault, pan	asperger, hundred, database, wicked, nam

Eval	Top Words Plot
English
Hindi
Hinglish
Full

7. Strategy 5 — Hinglish → English → Hindi

Eval On	Top Hate Words	Top Non-Hate Words
English	bastard, establishes, code, poo, hub	blatantly, languages, turkey, fags, gear
Hindi	रंजन, गोगोई, नड्डा, सांसदों, पित्त	चूतिए, मुल्ले, हथियार, उपनिषद, जन्मभूमि
Hinglish	huye, dada, abb, arey, abduction	rajya, bahu, parliament, code, music
Full	skua, praised, spic, sabse, plz	liberty, languages, speaks, bache, maine

Eval	Top Words Plot
English
Hindi
Hinglish
Full

Interesting: Hindi non-hate words in this strategy include चूतिए and मुल्ले — which are slurs — yet the model assigns them negative SHAP. This reflects the model's confusion after ending on Hindi (last phase): it cannot consistently assign direction to Hindi vocabulary even when the words are inherently offensive.

8. Strategy 6 — Hinglish → Hindi → English

Eval On	Top Hate Words	Top Non-Hate Words
English	opponents, massacres, coon, ahh, fitness	annie, model, nearly, lloyd, nest
Hindi	लें, अमिताभ, मी, करेंगे, रखता	आंखें, जे, बुरे, लोगे, जायज़ा
Hinglish	fav, janab, chori, cum, ruk	online, gau, dehli, 2017, rajya
Full	srk, roj, rhi, purana, aapke	nest, maine, hone, haired, barrel

Eval	Top Words Plot
English
Hindi
Hinglish
Full

9. Cross-Strategy Comparison

The heatmap below shows the top-5 words from each strategy, evaluated across all 6 models simultaneously. Each cell is the mean SHAP value for that word in that model. Red = pushes toward hate, blue = pushes toward non-hate, white = word not seen or neutral.

English test set:

Hindi test set:

Hinglish test set:

Full test set:

10. Key Findings

SHAP Magnitude Reflects Language Confidence

| Language | Typical Top |SHAP| | Why | |---|---|---| | English | 0.03 – 0.08 | GloVe has 6B token English coverage — rich, meaningful vectors | | Hinglish | 0.03 – 0.07 | Roman-script words partially covered by GloVe; model learns strong patterns | | Hindi | 0.002 – 0.005 | Devanagari words are mostly OOV in GloVe — near-zero vectors, tiny gradients |

This directly explains the Hindi accuracy gap across all 6 strategies. The model is essentially guessing from context and position for Hindi, not from word meaning.

Consistent Non-Hate Signals Across All 6 Models

"online" — appears as a top non-hate predictor in 4/6 strategies. Informational/conversational context: people discussing online behaviour, platforms, reporting rather than targeting.
"rajya" (state/parliament) — consistent non-hate in Hinglish eval. Political discussion about government is distinguishable from targeted hate.
"maine" (I/me in Hindi-Urdu) — first-person perspective associated with personal narrative rather than targeted attacks.

Hate Speech Markers Are Linguistically Coherent

English: Direct accusation verbs (blame, blaming, criticized) and violence vocabulary (massacres, cleansing, violence) are the most consistent. These reflect how hate speech in this dataset frames targets through accusation and dehumanisation.

Hinglish: Familial insults (behan = sister, bacchi = girl/child — used in gendered abuse), aggressive interjections (arey, abb, ruk), and celebrity/political names in abusive context (srk, krk, dada) — reflects South Asian social media abuse patterns.

Hindi: Body/violence metaphors (गला = throat, as in strangle; मूर्ख = fool) and politically charged proper nouns (भूमिपूजन = Ram Mandir ground-breaking ceremony — a highly polarising event in the dataset period).

Visible Spurious Correlations

Several high-SHAP words are clearly not meaningful hate indicators:

skua (a seabird) — appears as top hate word in 2 strategies. Rare word co-occurring with hateful text in training; model memorised the co-occurrence.
ahh — exclamation appearing in multiple models as a hate signal. Aggressive tone marker rather than a meaningful word.
coon appearing as non-hate in Strategy 2 — this is a racial slur, but in this model it was learned as a feature in a specific context (likely news/reporting usage in the dataset).

These spurious patterns are an inherent limitation of non-contextual GloVe embeddings: the model sees a word as a fixed vector regardless of the sentence it appears in.

tuklu
/

SASC