SASCv2 / SHAP_REPORT.md
tuklu's picture
Add v2 SHAP report — isolated, full methodology
1f323e6 verified

SHAP Explainability Report — Multilingual Hate Speech Detection (v2)

Model: Hinglish → Hindi → English → Full (GloVe + BiLSTM, 50 epochs/phase, 200 total) Explainer: shap.GradientExplainer on the BiLSTM sub-model Background: 200 randomly sampled training sequences Test samples: 500 per language (random subset), full test set = 8,852


Table of Contents

  1. What SHAP Is and How It Works Here
  2. How the Top-20 Words Are Selected
  3. Results by Evaluation Language
  4. Key Findings

1. What SHAP Is and How It Works Here

SHAP (SHapley Additive exPlanations) answers the question: "For this specific prediction, how much did each word contribute to the output?"

It does this using game theory — it treats each word as a "player" and fairly distributes the model's final prediction score among all the words in the input.

The Technical Challenge

Our model takes integer token IDs as input (e.g. the word "hate" → token ID 4231). SHAP needs to compute gradients — mathematical slopes — through the model to measure each word's influence. But you cannot take a gradient through an integer operation.

Solution — split the model into two steps:

Step 1 (outside SHAP):
  Token IDs (integers) ──→ GloVe Embedding lookup ──→ Float vectors (n × 100 × 300)

Step 2 (SHAP runs here):
  Float embedding vectors ──→ BiLSTM ──→ Dropout ──→ Dense ──→ Sigmoid output (0 to 1)

SHAP's GradientExplainer runs on Step 2. It computes how much each of the 300 embedding dimensions at each of the 100 token positions contributed to the final score. We then sum across all 300 dimensions for each position to get one number per word — its overall contribution.

What the Numbers Mean

  • Positive SHAP value → word pushes prediction toward hate speech (output closer to 1.0)
  • Negative SHAP value → word pushes prediction toward non-hate (output closer to 0.0)
  • Magnitude → how strongly that word influences the decision

Background Dataset

The explainer needs a "baseline" — what does the model output with no meaningful input? We use 200 randomly sampled training sequences as this baseline. SHAP measures each word's contribution relative to this average.


2. How the Top-20 Words Are Selected

This is not random. Here is the exact process:

Step 1 — Run SHAP on the test set For 500 randomly sampled test examples per language, we get a SHAP value for every word in every example. A 100-token sequence → 100 SHAP values.

Step 2 — Aggregate by word across all examples For each unique word appearing in those 500 examples, compute the mean SHAP value across all its occurrences:

word "blamed" appears 31 times across 500 examples
SHAP values: [+0.061, +0.058, +0.072, ..., +0.055]
Mean SHAP = sum / 31 = +0.064

Step 3 — Rank by absolute mean SHAP Words sorted by |mean SHAP| — captures both strong hate predictors (large positive) and strong non-hate predictors (large negative).

Step 4 — Show top 20 The 20 words with the highest |mean SHAP| are plotted:

  • 🔴 Red bars extending right = pushes toward hate
  • 🔵 Blue bars extending left = pushes toward non-hate

Padding tokens (ID=0) are always excluded. Only real words contribute.


3. Results by Evaluation Language

3.1 English Test Set

Top Hate Words Top Non-Hate Words
nas, fags, sicko, sabotage, advocating grow, barrel, homosexual, pak, join

SHAP Top Words — English

The English hate markers skew toward explicit aggression (sicko, fags) and intentional framing (sabotage, advocating) — the 50-epoch English phase produced sharper English feature learning than shorter-training models. The appearance of homosexual as a non-hate word reflects that this term appears more in informational/news text in the dataset than in targeted slurs.


3.2 Hindi Test Set

Top Hate Words Top Non-Hate Words
वादा, वैज्ञानिकों, ऐ, उतारा, गला जीतेगा, घोंटने, जिहादी, आपत्तिजनक, चमचो

SHAP Top Words — Hindi

Note on bar size: Hindi SHAP values are approximately 10× smaller than English (range ~0.003–0.007 vs ~0.02–0.06). This is not a model failure — it reflects that GloVe was trained on English text and has near-zero vectors for Devanagari words. The model makes Hindi predictions from positional/sequential patterns rather than word semantics. The words गला (throat — as in "गला घोंटना", strangle) and (hey — aggressive address) showing as hate markers suggest the model did capture some Hindi-specific abuse patterns through the 50-epoch Hinglish-first training.

Interestingly, जिहादी (jihadist) appears as a non-hate predictor — this likely reflects news/reporting usage in the dataset rather than targeted hate, and shows how context-free embeddings can produce counterintuitive results.


3.3 Hinglish Test Set

Top Hate Words Top Non-Hate Words
arey, bahir, punish, papa, interior online, member, mam, messages, asha

SHAP Top Words — Hinglish

The Hinglish markers are notably semantically coherent compared to v1 (8-epoch) models. arey is a Hindi-derived exclamation almost always used in confrontational tone in social media; bahir (outside/get out) reflects exclusionary language; punish as a verb signals threat. The non-hate side shows platform/conversation context (online, messages, mam). Starting on Hinglish for 50 epochs produced the strongest Hinglish-specific feature learning of any model trained in this project.


3.4 Full Test Set

Top Hate Words Top Non-Hate Words
blamed, criticized, syntax, grown, sine underneath, smack, online, hole, clue

SHAP Top Words — Full

The full-dataset view shows accusatory framing (blamed, criticized) as the dominant hate signal — more so than in any v1 model. The 50-epoch Full phase appears to have learned that hate speech in this multilingual corpus characteristically targets victims through blame and accusation, regardless of language. The spurious markers syntax and sine (very rare technical words) are outliers that co-occurred with hate content in specific training examples.


4. Key Findings

SHAP Magnitude Reflects Language Confidence

| Language | Typical Top |SHAP| | Why | |---|---|---| | English | 0.02 – 0.06 | GloVe covers English comprehensively — strong, directional gradients | | Hinglish | 0.04 – 0.46 | Roman-script words partially covered; model learned strong code-mixed patterns from Phase 1 | | Hindi | 0.003 – 0.007 | Devanagari script is OOV in GloVe — near-zero vectors produce tiny gradients |

Hindi confidence is 10× lower than English. This is a fundamental limitation of using GloVe (English-trained) for multilingual text. A multilingual model (e.g. mBERT, MuRIL) would show much more balanced confidence across languages.

Effect of 50-Epoch Training vs 8-Epoch

Compared to the equivalent v1 strategy (Hinglish → Hindi → English, 8 epochs), the v2 model shows:

  • More directional Hinglish markersarey, bahir, punish are contextually coherent; v1 had noisier Hinglish top words
  • Accusatory framing as primary hate signalblamed, criticized in full eval rather than random rare words; deeper training on the Full phase produced this generalisation
  • Similar English markers — both models converge on English hate vocabulary after the English phase; the difference is mainly in non-English languages

Consistent Non-Hate Signal

"online" is the most stable non-hate predictor across all four evaluation sets. It appears in informational, conversational, and reporting contexts in all three languages — the model correctly identifies this as a non-toxic context word.

Spurious Correlations

syntax and sine appearing as hate markers in the full evaluation are clear spurious correlations — rare technical words that happened to co-occur with hateful content in a small number of training examples. The BiLSTM cannot distinguish this from meaningful signal because GloVe has no semantic grounding for these words in a hate/non-hate context.

This is the primary motivation for contextual models: BERT-based approaches would represent the same word differently depending on surrounding context, eliminating most spurious co-occurrence patterns.