tuklu commited on
Commit
538649a
·
verified ·
1 Parent(s): cfe8562

Add README

Browse files
Files changed (1) hide show
  1. README.md +206 -3
README.md CHANGED
@@ -1,3 +1,206 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - hi
5
+ tags:
6
+ - hate-speech
7
+ - text-classification
8
+ - bilstm
9
+ - glove
10
+ - multilingual
11
+ - transfer-learning
12
+ - hinglish
13
+ - sequential-learning
14
+ datasets:
15
+ - tuklu/nprism
16
+ license: mit
17
+ model-index:
18
+ - name: hate-speech-multilingual-bilstm-v2
19
+ results:
20
+ - task:
21
+ type: text-classification
22
+ name: Hate Speech Detection
23
+ dataset:
24
+ name: nprism
25
+ type: tuklu/nprism
26
+ metrics:
27
+ - type: f1
28
+ value: 0.6566
29
+ name: F1 Score (Full Phase — Full Test)
30
+ - type: accuracy
31
+ value: 0.6866
32
+ name: Accuracy (Full Phase — Full Test)
33
+ - type: roc_auc
34
+ value: 0.7556
35
+ name: ROC-AUC (Full Phase — Full Test)
36
+ ---
37
+
38
+ # Multilingual Hate Speech Detection — GloVe + BiLSTM (v2)
39
+
40
+ **Task:** Binary text classification (Hate / Non-Hate)
41
+ **Languages:** English, Hindi, Hinglish (Hindi-English code-mixed)
42
+ **Architecture:** Bidirectional LSTM with frozen GloVe embeddings
43
+ **Strategy:** Hinglish → Hindi → English → Full (50 epochs per phase, 200 total)
44
+
45
+ ---
46
+
47
+ ## Table of Contents
48
+ 1. [What This Experiment Does](#1-what-this-experiment-does)
49
+ 2. [The Dataset](#2-the-dataset)
50
+ 3. [Model Architecture](#3-model-architecture)
51
+ 4. [Training Strategy](#4-training-strategy)
52
+ 5. [Results](#5-results)
53
+ 6. [Figures](#6-figures)
54
+ 7. [How to Use](#7-how-to-use)
55
+
56
+ ---
57
+
58
+ ## 1. What This Experiment Does
59
+
60
+ This is **v2** of the SASC sequential transfer learning experiment.
61
+
62
+ While v1 tested all 6 possible language orderings with 8 epochs per phase, **v2 focuses on a single fixed strategy** — `Hinglish → Hindi → English → Full` — but trains for **50 epochs per phase (200 total)**. This deeper training reveals how well knowledge accumulates across languages when starting from the hardest (most data-scarce, code-mixed) language first.
63
+
64
+ After every phase the model is evaluated on **all three individual language test sets as well as the full test set**, giving a 4×4 cross-evaluation matrix.
65
+
66
+ ---
67
+
68
+ ## 2. The Dataset
69
+
70
+ Dataset: [tuklu/nprism](https://huggingface.co/datasets/tuklu/nprism)
71
+
72
+ | Split | Samples |
73
+ |---|---|
74
+ | Train | 17,704 |
75
+ | Validation | 2,950 |
76
+ | Test | 8,852 |
77
+ | **Total** | **29,505** |
78
+
79
+ | Language | Count | % |
80
+ |---|---|---|
81
+ | English | 14,994 | 50.8% |
82
+ | Hindi | 9,738 | 33.0% |
83
+ | Hinglish | 4,774 | 16.2% |
84
+
85
+ | Label | Count | % |
86
+ |---|---|---|
87
+ | Non-Hate (0) | 15,799 | 53.5% |
88
+ | Hate (1) | 13,707 | 46.5% |
89
+
90
+ ![Language Distribution](figures/language_distribution.png)
91
+
92
+ ---
93
+
94
+ ## 3. Model Architecture
95
+
96
+ ```
97
+ Embedding (GloVe 300d, frozen, vocab=50k, maxlen=100)
98
+
99
+ Bidirectional LSTM (128 units)
100
+
101
+ Dropout (0.5)
102
+
103
+ Dense (64, ReLU)
104
+
105
+ Dense (1, Sigmoid)
106
+ ```
107
+
108
+ - **Optimizer:** Adam
109
+ - **Loss:** Binary Crossentropy
110
+ - **Batch size:** 32 (language phases), 64 (full phase)
111
+
112
+ ---
113
+
114
+ ## 4. Training Strategy
115
+
116
+ | Phase | Data | Epochs | Batch Size |
117
+ |---|---|---|---|
118
+ | 1 — Hinglish | Hinglish train subset | 50 | 32 |
119
+ | 2 — Hindi | Hindi train subset | 50 | 32 |
120
+ | 3 — English | English train subset | 50 | 32 |
121
+ | 4 — Full | Full shuffled train | 50 | 64 |
122
+
123
+ The same model weights carry forward through all 4 phases — no reset between languages.
124
+
125
+ ---
126
+
127
+ ## 5. Results
128
+
129
+ Full cross-evaluation table (Phase × Eval Language):
130
+
131
+ | Phase | Eval On | Accuracy | Balanced Acc | Precision | Recall | Specificity | F1 | ROC-AUC |
132
+ |---|---|---|---|---|---|---|---|---|
133
+ | hinglish | english | 0.5171 | 0.5125 | 0.5738 | 0.0916 | 0.9334 | 0.1580 | 0.5620 |
134
+ | hinglish | hindi | 0.4493 | 0.5000 | 0.4493 | 1.0000 | 0.0000 | 0.6200 | 0.5234 |
135
+ | hinglish | hinglish | 0.6688 | 0.6378 | 0.6058 | 0.4848 | 0.7908 | 0.5386 | 0.6579 |
136
+ | hinglish | full | 0.5190 | 0.5133 | 0.4803 | 0.4331 | 0.5935 | 0.4555 | 0.5243 |
137
+ | hindi | english | 0.4711 | 0.4744 | 0.4789 | 0.7878 | 0.1611 | 0.5957 | 0.4292 |
138
+ | hindi | hindi | 0.5834 | 0.5730 | 0.5420 | 0.4705 | 0.6756 | 0.5037 | 0.5949 |
139
+ | hindi | hinglish | 0.5409 | 0.4885 | 0.3761 | 0.2299 | 0.7470 | 0.2854 | 0.4771 |
140
+ | hindi | full | 0.5190 | 0.5251 | 0.4859 | 0.6111 | 0.4390 | 0.5414 | 0.5255 |
141
+ | english | english | 0.7721 | 0.7726 | 0.7453 | 0.8190 | 0.7262 | 0.7804 | 0.8458 |
142
+ | english | hindi | 0.5424 | 0.5399 | 0.4912 | 0.5150 | 0.5648 | 0.5028 | 0.5377 |
143
+ | english | hinglish | 0.4115 | 0.4938 | 0.3955 | 0.9002 | 0.0875 | 0.5495 | 0.4572 |
144
+ | english | full | 0.6395 | 0.6458 | 0.5901 | 0.7337 | 0.5578 | 0.6541 | 0.6913 |
145
+ | **Full** | **english** | **0.7747** | **0.7746** | **0.7747** | **0.7678** | **0.7815** | **0.7712** | **0.8476** |
146
+ | **Full** | **hindi** | **0.5748** | **0.5676** | **0.5286** | **0.4958** | **0.6393** | **0.5117** | **0.5941** |
147
+ | **Full** | **hinglish** | **0.6326** | **0.6101** | **0.5426** | **0.4991** | **0.7210** | **0.5200** | **0.6161** |
148
+ | **Full** | **full** | **0.6866** | **0.6839** | **0.6687** | **0.6449** | **0.7228** | **0.6566** | **0.7556** |
149
+
150
+ ### Key Observations
151
+
152
+ - **English phase is the turning point**: F1 on full test jumps from 0.541 → 0.654 after seeing English data, reflecting GloVe's English-centric embeddings.
153
+ - **Starting from Hinglish** forces the model to generalise from noisy code-mixed text first — the model reaches Hinglish F1=0.539 on the Hinglish test after just the Hinglish phase.
154
+ - **Final Full phase** improves balanced accuracy and specificity across all languages, reaching AUC=0.756 on the full test set.
155
+ - Hindi remains the hardest language to generalise to (F1=0.512 after Full phase), consistent with GloVe having limited Hindi coverage.
156
+
157
+ ---
158
+
159
+ ## 6. Figures
160
+
161
+ Training curves and evaluation plots for every phase × language combination are in the `figures/hinglish_to_hindi_to_english/` directory.
162
+
163
+ **Training curves (Accuracy & Loss):**
164
+ - `Phase_hinglish_curves.png`
165
+ - `Phase_hindi_curves.png`
166
+ - `Phase_english_curves.png`
167
+ - `Phase_Full_curves.png`
168
+
169
+ **Per-phase evaluation (CM / ROC / PR / F1 curve) for each language + full:**
170
+ - `Phase_{phase}_eval_{lang}_cm.png`
171
+ - `Phase_{phase}_eval_{lang}_roc.png`
172
+ - `Phase_{phase}_eval_{lang}_pr.png`
173
+ - `Phase_{phase}_eval_{lang}_f1.png`
174
+
175
+ ---
176
+
177
+ ## 7. How to Use
178
+
179
+ ```python
180
+ import numpy as np
181
+ import json
182
+ from tensorflow.keras.models import load_model
183
+ from tensorflow.keras.preprocessing.sequence import pad_sequences
184
+
185
+ # Load model
186
+ model = load_model("hinglish_hindi_english_full.h5")
187
+
188
+ # Load tokenizer
189
+ with open("tokenizer.json") as f:
190
+ from tensorflow.keras.preprocessing.text import tokenizer_from_json
191
+ tokenizer = tokenizer_from_json(json.load(f) if isinstance(json.load(open("tokenizer.json")), str) else open("tokenizer.json").read())
192
+
193
+ # Predict
194
+ texts = ["your text here"]
195
+ seqs = pad_sequences(tokenizer.texts_to_sequences(texts), maxlen=100)
196
+ prob = model.predict(seqs)[0][0]
197
+ label = "Hate" if prob > 0.5 else "Non-Hate"
198
+ print(f"{label} ({prob:.4f})")
199
+ ```
200
+
201
+ ---
202
+
203
+ ## Related
204
+
205
+ - **v1 (all 6 strategies, 8 epochs):** [tuklu/SASC](https://huggingface.co/tuklu/SASC)
206
+ - **Dataset:** [tuklu/nprism](https://huggingface.co/datasets/tuklu/nprism)