Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,310 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hate Speech Detection β Multilingual Sequential Transfer Learning
|
| 2 |
+
### GloVe Embeddings + Bidirectional LSTM (BiLSTM)
|
| 3 |
+
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
## What is this project about?
|
| 7 |
+
|
| 8 |
+
This project builds a system that can automatically detect **hate speech** in text written in three languages:
|
| 9 |
+
- **English** β standard English text
|
| 10 |
+
- **Hindi** β Hindi text (transliterated or native script)
|
| 11 |
+
- **Hinglish** β a mix of Hindi and English (very common in Indian social media)
|
| 12 |
+
|
| 13 |
+
The core question we are trying to answer is:
|
| 14 |
+
|
| 15 |
+
> **Does the order in which you teach a model different languages matter for how well it performs?**
|
| 16 |
+
|
| 17 |
+
For example β is a model that learns English first, then Hindi, then Hinglish better or worse than one that learns Hinglish first?
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## The Dataset
|
| 22 |
+
|
| 23 |
+
| Property | Value |
|
| 24 |
+
|---|---|
|
| 25 |
+
| Total samples | 29,505 |
|
| 26 |
+
| English samples | 14,994 (50.8%) |
|
| 27 |
+
| Hindi samples | 9,738 (33.0%) |
|
| 28 |
+
| Hinglish samples | 4,774 (16.2%) |
|
| 29 |
+
| Hate speech (label=1) | 13,707 (46.5%) |
|
| 30 |
+
| Non-hate speech (label=0) | 15,799 (53.5%) |
|
| 31 |
+
|
| 32 |
+

|
| 33 |
+
|
| 34 |
+
The dataset was split into three parts:
|
| 35 |
+
- **Training set** β 17,704 samples (used to teach the model)
|
| 36 |
+
- **Validation set** β 2,950 samples (used to monitor learning during training)
|
| 37 |
+
- **Test set** β 8,852 samples (used only at the end to measure real performance)
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## The Model β What is GloVe + BiLSTM?
|
| 42 |
+
|
| 43 |
+
Think of the model like a two-part reading machine:
|
| 44 |
+
|
| 45 |
+
### Part 1: GloVe Embeddings (the dictionary)
|
| 46 |
+
Before the model can understand words, it needs to know what words *mean* relative to each other. GloVe (Global Vectors) is a pre-trained lookup table of **300,000+ English words**, where each word is represented as a list of 300 numbers that capture its meaning. Words with similar meanings end up with similar numbers.
|
| 47 |
+
|
| 48 |
+
- We used `glove.6B.300d.txt` β 6 billion word training corpus, 300 dimensions
|
| 49 |
+
- The embedding layer is **frozen** (not updated during training) β we keep GloVe's knowledge as-is and only train the layers on top
|
| 50 |
+
|
| 51 |
+
### Part 2: Bidirectional LSTM (the reader)
|
| 52 |
+
An LSTM (Long Short-Term Memory) is a type of neural network designed to read sequences β like sentences β and remember what it read. **Bidirectional** means it reads the sentence both forwards and backwards, so it understands context from both directions.
|
| 53 |
+
|
| 54 |
+
```
|
| 55 |
+
Input sentence
|
| 56 |
+
β
|
| 57 |
+
GloVe Embeddings (300d, frozen)
|
| 58 |
+
β
|
| 59 |
+
BiLSTM (128 units, reads leftβright AND rightβleft)
|
| 60 |
+
β
|
| 61 |
+
Dropout (50% β randomly switches off neurons to prevent overfitting)
|
| 62 |
+
β
|
| 63 |
+
Dense layer (64 neurons, ReLU activation)
|
| 64 |
+
β
|
| 65 |
+
Output (1 neuron, Sigmoid β gives a probability 0 to 1)
|
| 66 |
+
β
|
| 67 |
+
> 0.5 = Hate Speech, β€ 0.5 = Not Hate Speech
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## The Training Strategy β What is Transfer Learning?
|
| 73 |
+
|
| 74 |
+
**Transfer learning** means the model carries what it learned from one task into the next. Like a student who already knows French β learning Spanish is easier because both share Latin roots.
|
| 75 |
+
|
| 76 |
+
In our case, we train the model on one language, and instead of starting fresh for the next language, we **keep all the weights (knowledge)** from the previous training. The model continues learning from where it left off.
|
| 77 |
+
|
| 78 |
+
### The Bug We Fixed
|
| 79 |
+
The original code was creating a **brand new model** for every language β resetting all the weights each time. That is not transfer learning, it's just training three separate models. We fixed this by building the model **once** and sequentially fine-tuning it.
|
| 80 |
+
|
| 81 |
+
```python
|
| 82 |
+
# WRONG β model reset every loop iteration
|
| 83 |
+
for lang in languages:
|
| 84 |
+
model = Sequential() # β new model = no transfer learning
|
| 85 |
+
model.fit(...)
|
| 86 |
+
|
| 87 |
+
# CORRECT β model built once, weights carry forward
|
| 88 |
+
model = build_model() # β built once
|
| 89 |
+
for lang in languages:
|
| 90 |
+
model.fit(...) # β continues learning from previous language
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
---
|
| 94 |
+
|
| 95 |
+
## Plan B β The Experiment
|
| 96 |
+
|
| 97 |
+
We ran all **6 possible orderings** of the three languages, each followed by a final training round on the complete shuffled dataset:
|
| 98 |
+
|
| 99 |
+
| # | Strategy |
|
| 100 |
+
|---|---|
|
| 101 |
+
| 1 | English β Hindi β Hinglish β Full |
|
| 102 |
+
| 2 | English β Hinglish β Hindi β Full |
|
| 103 |
+
| 3 | Hindi β English β Hinglish β Full |
|
| 104 |
+
| 4 | Hindi β Hinglish β English β Full |
|
| 105 |
+
| 5 | Hinglish β English β Hindi β Full |
|
| 106 |
+
| 6 | Hinglish β Hindi β English β Full |
|
| 107 |
+
|
| 108 |
+
For each strategy, training happens in 4 phases. **After each phase**, we immediately evaluate the model on that specific language's test data and record all metrics. This tells us how well the model performs at each stage of the learning journey.
|
| 109 |
+
|
| 110 |
+
```
|
| 111 |
+
Phase 1: Train on Language A β Test on Language A test set β Record metrics + plots
|
| 112 |
+
Phase 2: Train on Language B β Test on Language B test set β Record metrics + plots
|
| 113 |
+
Phase 3: Train on Language C β Test on Language C test set β Record metrics + plots
|
| 114 |
+
Phase 4: Train on Full data β Test on Full test set β Record metrics + plots
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
Each phase used **8 epochs** with batch size 32 (64 for the full phase).
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
+
|
| 121 |
+
## Metrics β What do we measure?
|
| 122 |
+
|
| 123 |
+
| Metric | What it means in plain English |
|
| 124 |
+
|---|---|
|
| 125 |
+
| **Accuracy** | Out of all predictions, how many were correct? |
|
| 126 |
+
| **Balanced Accuracy** | Accuracy adjusted for class imbalance (more fair) |
|
| 127 |
+
| **Precision** | Of everything the model flagged as hate speech, how much actually was? |
|
| 128 |
+
| **Recall** | Of all actual hate speech, how much did the model catch? |
|
| 129 |
+
| **Specificity** | Of all non-hate speech, how much did the model correctly ignore? |
|
| 130 |
+
| **F1 Score** | Balance between Precision and Recall (harmonic mean) |
|
| 131 |
+
| **ROC-AUC** | Overall ability to distinguish hate from non-hate (1.0 = perfect) |
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Results Summary
|
| 136 |
+
|
| 137 |
+
Full results are in `output/results_tables/all_strategies_results.csv`. Key highlights:
|
| 138 |
+
|
| 139 |
+
### English phase performance across strategies (best language)
|
| 140 |
+
|
| 141 |
+
| Strategy | Accuracy | F1 | ROC-AUC |
|
| 142 |
+
|---|---|---|---|
|
| 143 |
+
| English β Hindi β Hinglish β Full | 0.7701 | 0.7696 | 0.8504 |
|
| 144 |
+
| English β Hinglish β Hindi β Full | 0.7721 | 0.7743 | 0.8525 |
|
| 145 |
+
| Hindi β English β Hinglish β Full | 0.7780 | 0.7830 | 0.8549 |
|
| 146 |
+
| Hindi β Hinglish β English β Full | 0.7780 | 0.7816 | 0.8563 |
|
| 147 |
+
| Hinglish β English β Hindi β Full | 0.7716 | 0.7829 | 0.8484 |
|
| 148 |
+
| Hinglish β Hindi β English β Full | 0.7765 | 0.7811 | 0.8534 |
|
| 149 |
+
|
| 150 |
+
### Full dataset phase (final performance)
|
| 151 |
+
|
| 152 |
+
| Strategy | Accuracy | F1 | ROC-AUC |
|
| 153 |
+
|---|---|---|---|
|
| 154 |
+
| English β Hindi β Hinglish β Full | 0.6796 | 0.5923 | 0.7599 |
|
| 155 |
+
| English β Hinglish β Hindi β Full | 0.6813 | 0.6244 | 0.7535 |
|
| 156 |
+
| Hindi β English β Hinglish β Full | 0.6854 | 0.6419 | 0.7528 |
|
| 157 |
+
| Hindi β Hinglish β English β Full | 0.6865 | 0.6364 | 0.7507 |
|
| 158 |
+
| Hinglish β English β Hindi β Full | 0.6778 | 0.6285 | 0.7521 |
|
| 159 |
+
| Hinglish β Hindi β English β Full | 0.6845 | 0.6301 | 0.7548 |
|
| 160 |
+
|
| 161 |
+
### Key observations
|
| 162 |
+
- **English** consistently achieves the highest accuracy (~77%) regardless of when it is trained β likely because GloVe embeddings are English-centric
|
| 163 |
+
- **Hindi** is the hardest language β accuracy hovers around 55β59% across all strategies
|
| 164 |
+
- **Hinglish** sits in the middle (~66β70%) which makes sense as it borrows heavily from English
|
| 165 |
+
- Strategies that train **Hindi first** (`Hindi β English β Hinglish`) tend to recover better in later phases, suggesting the model benefits from tackling the hardest language early
|
| 166 |
+
- The **Full phase** shows consistent ~68% accuracy across all strategies, suggesting the final shuffled training normalises the differences introduced by ordering
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## Plots by Strategy
|
| 171 |
+
|
| 172 |
+
### Strategy 1: English β Hindi β Hinglish β Full
|
| 173 |
+
|
| 174 |
+
| Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
|
| 175 |
+
|---|---|---|---|---|---|
|
| 176 |
+
| English |  |  |  |  |  |
|
| 177 |
+
| Hindi |  |  |  |  |  |
|
| 178 |
+
| Hinglish |  |  |  |  |  |
|
| 179 |
+
| Full |  |  |  |  |  |
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
### Strategy 2: English β Hinglish β Hindi β Full
|
| 184 |
+
|
| 185 |
+
| Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
|
| 186 |
+
|---|---|---|---|---|---|
|
| 187 |
+
| English |  |  |  |  |  |
|
| 188 |
+
| Hinglish |  |  |  |  |  |
|
| 189 |
+
| Hindi |  |  |  |  |  |
|
| 190 |
+
| Full |  |  |  |  |  |
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
### Strategy 3: Hindi β English β Hinglish β Full
|
| 195 |
+
|
| 196 |
+
| Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
|
| 197 |
+
|---|---|---|---|---|---|
|
| 198 |
+
| Hindi |  |  |  |  |  |
|
| 199 |
+
| English |  |  |  |  |  |
|
| 200 |
+
| Hinglish |  |  |  |  |  |
|
| 201 |
+
| Full |  |  |  |  |  |
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
+
### Strategy 4: Hindi β Hinglish β English β Full
|
| 206 |
+
|
| 207 |
+
| Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
|
| 208 |
+
|---|---|---|---|---|---|
|
| 209 |
+
| Hindi |  |  |  |  |  |
|
| 210 |
+
| Hinglish |  |  |  |  |  |
|
| 211 |
+
| English |  |  |  |  |  |
|
| 212 |
+
| Full |  |  |  |  |  |
|
| 213 |
+
|
| 214 |
+
---
|
| 215 |
+
|
| 216 |
+
### Strategy 5: Hinglish β English β Hindi β Full
|
| 217 |
+
|
| 218 |
+
| Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
|
| 219 |
+
|---|---|---|---|---|---|
|
| 220 |
+
| Hinglish |  |  |  |  |  |
|
| 221 |
+
| English |  |  |  |  |  |
|
| 222 |
+
| Hindi |  |  |  |  |  |
|
| 223 |
+
| Full |  |  |  |  |  |
|
| 224 |
+
|
| 225 |
+
---
|
| 226 |
+
|
| 227 |
+
### Strategy 6: Hinglish β Hindi β English β Full
|
| 228 |
+
|
| 229 |
+
| Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
|
| 230 |
+
|---|---|---|---|---|---|
|
| 231 |
+
| Hinglish |  |  |  |  |  |
|
| 232 |
+
| Hindi |  |  |  |  |  |
|
| 233 |
+
| English |  |  |  |  |  |
|
| 234 |
+
| Full |  |  |  |  |  |
|
| 235 |
+
|
| 236 |
+
---
|
| 237 |
+
|
| 238 |
+
## Output Files
|
| 239 |
+
|
| 240 |
+
```
|
| 241 |
+
output/
|
| 242 |
+
βββ dataset_splits/
|
| 243 |
+
β βββ train.csv # 17,704 training samples
|
| 244 |
+
β βββ val.csv # 2,950 validation samples
|
| 245 |
+
β βββ test.csv # 8,852 test samples
|
| 246 |
+
β
|
| 247 |
+
βββ results_tables/
|
| 248 |
+
β βββ all_strategies_results.csv # All 24 rows (6 strategies Γ 4 phases)
|
| 249 |
+
β βββ english_to_hindi_to_hinglish_results.csv
|
| 250 |
+
β βββ english_to_hinglish_to_hindi_results.csv
|
| 251 |
+
β βββ hindi_to_english_to_hinglish_results.csv
|
| 252 |
+
β βββ hindi_to_hinglish_to_english_results.csv
|
| 253 |
+
β βββ hinglish_to_english_to_hindi_results.csv
|
| 254 |
+
β βββ hinglish_to_hindi_to_english_results.csv
|
| 255 |
+
β
|
| 256 |
+
βββ figures/
|
| 257 |
+
βββ language_distribution.png # Pie chart of dataset languages
|
| 258 |
+
β
|
| 259 |
+
βββ english_to_hindi_to_hinglish/ # One folder per strategy
|
| 260 |
+
β βββ *_[english]_curves.png # Train/Val accuracy + loss
|
| 261 |
+
β βββ *_[english]_cm.png # Confusion matrix
|
| 262 |
+
β βββ *_[english]_roc.png # ROC curve
|
| 263 |
+
β βββ *_[english]_pr.png # Precision-Recall curve
|
| 264 |
+
β βββ *_[english]_f1.png # F1 vs Threshold curve
|
| 265 |
+
β βββ *_[hindi]_curves.png
|
| 266 |
+
β βββ *_[hindi]_cm.png ...
|
| 267 |
+
β βββ *_[hinglish]_curves.png
|
| 268 |
+
β βββ *_[hinglish]_cm.png ...
|
| 269 |
+
β βββ *_[Full]_curves.png
|
| 270 |
+
β βββ *_[Full]_cm.png ...
|
| 271 |
+
β
|
| 272 |
+
βββ english_to_hinglish_to_hindi/
|
| 273 |
+
βββ hindi_to_english_to_hinglish/
|
| 274 |
+
βββ hindi_to_hinglish_to_english/
|
| 275 |
+
βββ hinglish_to_english_to_hindi/
|
| 276 |
+
βββ hinglish_to_hindi_to_english/
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
---
|
| 280 |
+
|
| 281 |
+
## How to Run
|
| 282 |
+
|
| 283 |
+
### Requirements
|
| 284 |
+
```bash
|
| 285 |
+
pip install tensorflow scikit-learn pandas seaborn matplotlib
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
You also need GloVe embeddings (`glove.6B.300d.txt`) placed at `/root/glove.6B.300d.txt`:
|
| 289 |
+
```bash
|
| 290 |
+
wget http://nlp.stanford.edu/data/glove.6B.zip && unzip glove.6B.zip
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
### Run
|
| 294 |
+
```bash
|
| 295 |
+
python main.py
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
Training was performed on an NVIDIA H200 GPU (Vast.ai) β total runtime approximately 15β20 minutes for all 6 strategies.
|
| 299 |
+
|
| 300 |
+
---
|
| 301 |
+
|
| 302 |
+
## Project Structure
|
| 303 |
+
|
| 304 |
+
```
|
| 305 |
+
SASC/
|
| 306 |
+
βββ main.py # Full training + evaluation pipeline
|
| 307 |
+
βββ dataset.csv # Raw dataset (29,505 samples)
|
| 308 |
+
βββ README.md # This file
|
| 309 |
+
βββ output/ # All results, figures, and model checkpoints
|
| 310 |
+
```
|