tuklu
/

SASCv2

+---
+language:
+- en
+- hi
+tags:
+- hate-speech
+- text-classification
+- bilstm
+- glove
+- multilingual
+- transfer-learning
+- hinglish
+- sequential-learning
+datasets:
+- tuklu/nprism
+license: mit
+model-index:
+- name: hate-speech-multilingual-bilstm-v2
+  results:
+  - task:
+      type: text-classification
+      name: Hate Speech Detection
+    dataset:
+      name: nprism
+      type: tuklu/nprism
+    metrics:
+    - type: f1
+      value: 0.6566
+      name: F1 Score (Full Phase — Full Test)
+    - type: accuracy
+      value: 0.6866
+      name: Accuracy (Full Phase — Full Test)
+    - type: roc_auc
+      value: 0.7556
+      name: ROC-AUC (Full Phase — Full Test)
+---
+# Multilingual Hate Speech Detection — GloVe + BiLSTM (v2)
+**Task:** Binary text classification (Hate / Non-Hate)
+**Languages:** English, Hindi, Hinglish (Hindi-English code-mixed)
+**Architecture:** Bidirectional LSTM with frozen GloVe embeddings
+**Strategy:** Hinglish → Hindi → English → Full (50 epochs per phase, 200 total)
+---
+## Table of Contents
+1. [What This Experiment Does](#1-what-this-experiment-does)
+2. [The Dataset](#2-the-dataset)
+3. [Model Architecture](#3-model-architecture)
+4. [Training Strategy](#4-training-strategy)
+5. [Results](#5-results)
+6. [Figures](#6-figures)
+7. [How to Use](#7-how-to-use)
+---
+## 1. What This Experiment Does
+This is **v2** of the SASC sequential transfer learning experiment.
+While v1 tested all 6 possible language orderings with 8 epochs per phase, **v2 focuses on a single fixed strategy** — `Hinglish → Hindi → English → Full` — but trains for **50 epochs per phase (200 total)**. This deeper training reveals how well knowledge accumulates across languages when starting from the hardest (most data-scarce, code-mixed) language first.
+After every phase the model is evaluated on **all three individual language test sets as well as the full test set**, giving a 4×4 cross-evaluation matrix.
+---
+## 2. The Dataset
+Dataset: [tuklu/nprism](https://huggingface.co/datasets/tuklu/nprism)
+| Split | Samples |
+|---|---|
+| Train | 17,704 |
+| Validation | 2,950 |
+| Test | 8,852 |
+| **Total** | **29,505** |
+| Language | Count | % |
+|---|---|---|
+| English | 14,994 | 50.8% |
+| Hindi | 9,738 | 33.0% |
+| Hinglish | 4,774 | 16.2% |
+| Label | Count | % |
+|---|---|---|
+| Non-Hate (0) | 15,799 | 53.5% |
+| Hate (1) | 13,707 | 46.5% |
+![Language Distribution](figures/language_distribution.png)
+---
+## 3. Model Architecture
+```
+Embedding (GloVe 300d, frozen, vocab=50k, maxlen=100)
+    ↓
+Bidirectional LSTM (128 units)
+    ↓
+Dropout (0.5)
+    ↓
+Dense (64, ReLU)
+    ↓
+Dense (1, Sigmoid)
+```
+- **Optimizer:** Adam
+- **Loss:** Binary Crossentropy
+- **Batch size:** 32 (language phases), 64 (full phase)
+---
+## 4. Training Strategy
+| Phase | Data | Epochs | Batch Size |
+|---|---|---|---|
+| 1 — Hinglish | Hinglish train subset | 50 | 32 |
+| 2 — Hindi | Hindi train subset | 50 | 32 |
+| 3 — English | English train subset | 50 | 32 |
+| 4 — Full | Full shuffled train | 50 | 64 |
+The same model weights carry forward through all 4 phases — no reset between languages.
+---
+## 5. Results
+Full cross-evaluation table (Phase × Eval Language):
+| Phase | Eval On | Accuracy | Balanced Acc | Precision | Recall | Specificity | F1 | ROC-AUC |
+|---|---|---|---|---|---|---|---|---|
+| hinglish | english | 0.5171 | 0.5125 | 0.5738 | 0.0916 | 0.9334 | 0.1580 | 0.5620 |
+| hinglish | hindi | 0.4493 | 0.5000 | 0.4493 | 1.0000 | 0.0000 | 0.6200 | 0.5234 |
+| hinglish | hinglish | 0.6688 | 0.6378 | 0.6058 | 0.4848 | 0.7908 | 0.5386 | 0.6579 |
+| hinglish | full | 0.5190 | 0.5133 | 0.4803 | 0.4331 | 0.5935 | 0.4555 | 0.5243 |
+| hindi | english | 0.4711 | 0.4744 | 0.4789 | 0.7878 | 0.1611 | 0.5957 | 0.4292 |
+| hindi | hindi | 0.5834 | 0.5730 | 0.5420 | 0.4705 | 0.6756 | 0.5037 | 0.5949 |
+| hindi | hinglish | 0.5409 | 0.4885 | 0.3761 | 0.2299 | 0.7470 | 0.2854 | 0.4771 |
+| hindi | full | 0.5190 | 0.5251 | 0.4859 | 0.6111 | 0.4390 | 0.5414 | 0.5255 |
+| english | english | 0.7721 | 0.7726 | 0.7453 | 0.8190 | 0.7262 | 0.7804 | 0.8458 |
+| english | hindi | 0.5424 | 0.5399 | 0.4912 | 0.5150 | 0.5648 | 0.5028 | 0.5377 |
+| english | hinglish | 0.4115 | 0.4938 | 0.3955 | 0.9002 | 0.0875 | 0.5495 | 0.4572 |
+| english | full | 0.6395 | 0.6458 | 0.5901 | 0.7337 | 0.5578 | 0.6541 | 0.6913 |
+| **Full** | **english** | **0.7747** | **0.7746** | **0.7747** | **0.7678** | **0.7815** | **0.7712** | **0.8476** |
+| **Full** | **hindi** | **0.5748** | **0.5676** | **0.5286** | **0.4958** | **0.6393** | **0.5117** | **0.5941** |
+| **Full** | **hinglish** | **0.6326** | **0.6101** | **0.5426** | **0.4991** | **0.7210** | **0.5200** | **0.6161** |
+| **Full** | **full** | **0.6866** | **0.6839** | **0.6687** | **0.6449** | **0.7228** | **0.6566** | **0.7556** |
+### Key Observations
+- **English phase is the turning point**: F1 on full test jumps from 0.541 → 0.654 after seeing English data, reflecting GloVe's English-centric embeddings.
+- **Starting from Hinglish** forces the model to generalise from noisy code-mixed text first — the model reaches Hinglish F1=0.539 on the Hinglish test after just the Hinglish phase.
+- **Final Full phase** improves balanced accuracy and specificity across all languages, reaching AUC=0.756 on the full test set.
+- Hindi remains the hardest language to generalise to (F1=0.512 after Full phase), consistent with GloVe having limited Hindi coverage.
+---
+## 6. Figures
+Training curves and evaluation plots for every phase × language combination are in the `figures/hinglish_to_hindi_to_english/` directory.
+**Training curves (Accuracy & Loss):**
+- `Phase_hinglish_curves.png`
+- `Phase_hindi_curves.png`
+- `Phase_english_curves.png`
+- `Phase_Full_curves.png`
+**Per-phase evaluation (CM / ROC / PR / F1 curve) for each language + full:**
+- `Phase_{phase}_eval_{lang}_cm.png`
+- `Phase_{phase}_eval_{lang}_roc.png`
+- `Phase_{phase}_eval_{lang}_pr.png`
+- `Phase_{phase}_eval_{lang}_f1.png`
+---
+## 7. How to Use
+```python
+import numpy as np
+import json
+from tensorflow.keras.models import load_model
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+# Load model
+model = load_model("hinglish_hindi_english_full.h5")
+# Load tokenizer
+with open("tokenizer.json") as f:
+    from tensorflow.keras.preprocessing.text import tokenizer_from_json
+    tokenizer = tokenizer_from_json(json.load(f) if isinstance(json.load(open("tokenizer.json")), str) else open("tokenizer.json").read())
+# Predict
+texts = ["your text here"]
+seqs = pad_sequences(tokenizer.texts_to_sequences(texts), maxlen=100)
+prob = model.predict(seqs)[0][0]
+label = "Hate" if prob > 0.5 else "Non-Hate"
+print(f"{label} ({prob:.4f})")
+```
+---
+## Related
+- **v1 (all 6 strategies, 8 epochs):** [tuklu/SASC](https://huggingface.co/tuklu/SASC)
+- **Dataset:** [tuklu/nprism](https://huggingface.co/datasets/tuklu/nprism)