tuklu commited on
Commit
fcc2612
Β·
verified Β·
1 Parent(s): cbf0a6c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +310 -3
README.md CHANGED
@@ -1,3 +1,310 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hate Speech Detection β€” Multilingual Sequential Transfer Learning
2
+ ### GloVe Embeddings + Bidirectional LSTM (BiLSTM)
3
+
4
+ ---
5
+
6
+ ## What is this project about?
7
+
8
+ This project builds a system that can automatically detect **hate speech** in text written in three languages:
9
+ - **English** β€” standard English text
10
+ - **Hindi** β€” Hindi text (transliterated or native script)
11
+ - **Hinglish** β€” a mix of Hindi and English (very common in Indian social media)
12
+
13
+ The core question we are trying to answer is:
14
+
15
+ > **Does the order in which you teach a model different languages matter for how well it performs?**
16
+
17
+ For example β€” is a model that learns English first, then Hindi, then Hinglish better or worse than one that learns Hinglish first?
18
+
19
+ ---
20
+
21
+ ## The Dataset
22
+
23
+ | Property | Value |
24
+ |---|---|
25
+ | Total samples | 29,505 |
26
+ | English samples | 14,994 (50.8%) |
27
+ | Hindi samples | 9,738 (33.0%) |
28
+ | Hinglish samples | 4,774 (16.2%) |
29
+ | Hate speech (label=1) | 13,707 (46.5%) |
30
+ | Non-hate speech (label=0) | 15,799 (53.5%) |
31
+
32
+ ![Language Distribution](output/figures/language_distribution.png)
33
+
34
+ The dataset was split into three parts:
35
+ - **Training set** β€” 17,704 samples (used to teach the model)
36
+ - **Validation set** β€” 2,950 samples (used to monitor learning during training)
37
+ - **Test set** β€” 8,852 samples (used only at the end to measure real performance)
38
+
39
+ ---
40
+
41
+ ## The Model β€” What is GloVe + BiLSTM?
42
+
43
+ Think of the model like a two-part reading machine:
44
+
45
+ ### Part 1: GloVe Embeddings (the dictionary)
46
+ Before the model can understand words, it needs to know what words *mean* relative to each other. GloVe (Global Vectors) is a pre-trained lookup table of **300,000+ English words**, where each word is represented as a list of 300 numbers that capture its meaning. Words with similar meanings end up with similar numbers.
47
+
48
+ - We used `glove.6B.300d.txt` β€” 6 billion word training corpus, 300 dimensions
49
+ - The embedding layer is **frozen** (not updated during training) β€” we keep GloVe's knowledge as-is and only train the layers on top
50
+
51
+ ### Part 2: Bidirectional LSTM (the reader)
52
+ An LSTM (Long Short-Term Memory) is a type of neural network designed to read sequences β€” like sentences β€” and remember what it read. **Bidirectional** means it reads the sentence both forwards and backwards, so it understands context from both directions.
53
+
54
+ ```
55
+ Input sentence
56
+ ↓
57
+ GloVe Embeddings (300d, frozen)
58
+ ↓
59
+ BiLSTM (128 units, reads leftβ†’right AND right←left)
60
+ ↓
61
+ Dropout (50% β€” randomly switches off neurons to prevent overfitting)
62
+ ↓
63
+ Dense layer (64 neurons, ReLU activation)
64
+ ↓
65
+ Output (1 neuron, Sigmoid β€” gives a probability 0 to 1)
66
+ ↓
67
+ > 0.5 = Hate Speech, ≀ 0.5 = Not Hate Speech
68
+ ```
69
+
70
+ ---
71
+
72
+ ## The Training Strategy β€” What is Transfer Learning?
73
+
74
+ **Transfer learning** means the model carries what it learned from one task into the next. Like a student who already knows French β€” learning Spanish is easier because both share Latin roots.
75
+
76
+ In our case, we train the model on one language, and instead of starting fresh for the next language, we **keep all the weights (knowledge)** from the previous training. The model continues learning from where it left off.
77
+
78
+ ### The Bug We Fixed
79
+ The original code was creating a **brand new model** for every language β€” resetting all the weights each time. That is not transfer learning, it's just training three separate models. We fixed this by building the model **once** and sequentially fine-tuning it.
80
+
81
+ ```python
82
+ # WRONG β€” model reset every loop iteration
83
+ for lang in languages:
84
+ model = Sequential() # ← new model = no transfer learning
85
+ model.fit(...)
86
+
87
+ # CORRECT β€” model built once, weights carry forward
88
+ model = build_model() # ← built once
89
+ for lang in languages:
90
+ model.fit(...) # ← continues learning from previous language
91
+ ```
92
+
93
+ ---
94
+
95
+ ## Plan B β€” The Experiment
96
+
97
+ We ran all **6 possible orderings** of the three languages, each followed by a final training round on the complete shuffled dataset:
98
+
99
+ | # | Strategy |
100
+ |---|---|
101
+ | 1 | English β†’ Hindi β†’ Hinglish β†’ Full |
102
+ | 2 | English β†’ Hinglish β†’ Hindi β†’ Full |
103
+ | 3 | Hindi β†’ English β†’ Hinglish β†’ Full |
104
+ | 4 | Hindi β†’ Hinglish β†’ English β†’ Full |
105
+ | 5 | Hinglish β†’ English β†’ Hindi β†’ Full |
106
+ | 6 | Hinglish β†’ Hindi β†’ English β†’ Full |
107
+
108
+ For each strategy, training happens in 4 phases. **After each phase**, we immediately evaluate the model on that specific language's test data and record all metrics. This tells us how well the model performs at each stage of the learning journey.
109
+
110
+ ```
111
+ Phase 1: Train on Language A β†’ Test on Language A test set β†’ Record metrics + plots
112
+ Phase 2: Train on Language B β†’ Test on Language B test set β†’ Record metrics + plots
113
+ Phase 3: Train on Language C β†’ Test on Language C test set β†’ Record metrics + plots
114
+ Phase 4: Train on Full data β†’ Test on Full test set β†’ Record metrics + plots
115
+ ```
116
+
117
+ Each phase used **8 epochs** with batch size 32 (64 for the full phase).
118
+
119
+ ---
120
+
121
+ ## Metrics β€” What do we measure?
122
+
123
+ | Metric | What it means in plain English |
124
+ |---|---|
125
+ | **Accuracy** | Out of all predictions, how many were correct? |
126
+ | **Balanced Accuracy** | Accuracy adjusted for class imbalance (more fair) |
127
+ | **Precision** | Of everything the model flagged as hate speech, how much actually was? |
128
+ | **Recall** | Of all actual hate speech, how much did the model catch? |
129
+ | **Specificity** | Of all non-hate speech, how much did the model correctly ignore? |
130
+ | **F1 Score** | Balance between Precision and Recall (harmonic mean) |
131
+ | **ROC-AUC** | Overall ability to distinguish hate from non-hate (1.0 = perfect) |
132
+
133
+ ---
134
+
135
+ ## Results Summary
136
+
137
+ Full results are in `output/results_tables/all_strategies_results.csv`. Key highlights:
138
+
139
+ ### English phase performance across strategies (best language)
140
+
141
+ | Strategy | Accuracy | F1 | ROC-AUC |
142
+ |---|---|---|---|
143
+ | English β†’ Hindi β†’ Hinglish β†’ Full | 0.7701 | 0.7696 | 0.8504 |
144
+ | English β†’ Hinglish β†’ Hindi β†’ Full | 0.7721 | 0.7743 | 0.8525 |
145
+ | Hindi β†’ English β†’ Hinglish β†’ Full | 0.7780 | 0.7830 | 0.8549 |
146
+ | Hindi β†’ Hinglish β†’ English β†’ Full | 0.7780 | 0.7816 | 0.8563 |
147
+ | Hinglish β†’ English β†’ Hindi β†’ Full | 0.7716 | 0.7829 | 0.8484 |
148
+ | Hinglish β†’ Hindi β†’ English β†’ Full | 0.7765 | 0.7811 | 0.8534 |
149
+
150
+ ### Full dataset phase (final performance)
151
+
152
+ | Strategy | Accuracy | F1 | ROC-AUC |
153
+ |---|---|---|---|
154
+ | English β†’ Hindi β†’ Hinglish β†’ Full | 0.6796 | 0.5923 | 0.7599 |
155
+ | English β†’ Hinglish β†’ Hindi β†’ Full | 0.6813 | 0.6244 | 0.7535 |
156
+ | Hindi β†’ English β†’ Hinglish β†’ Full | 0.6854 | 0.6419 | 0.7528 |
157
+ | Hindi β†’ Hinglish β†’ English β†’ Full | 0.6865 | 0.6364 | 0.7507 |
158
+ | Hinglish β†’ English β†’ Hindi β†’ Full | 0.6778 | 0.6285 | 0.7521 |
159
+ | Hinglish β†’ Hindi β†’ English β†’ Full | 0.6845 | 0.6301 | 0.7548 |
160
+
161
+ ### Key observations
162
+ - **English** consistently achieves the highest accuracy (~77%) regardless of when it is trained β€” likely because GloVe embeddings are English-centric
163
+ - **Hindi** is the hardest language β€” accuracy hovers around 55–59% across all strategies
164
+ - **Hinglish** sits in the middle (~66–70%) which makes sense as it borrows heavily from English
165
+ - Strategies that train **Hindi first** (`Hindi β†’ English β†’ Hinglish`) tend to recover better in later phases, suggesting the model benefits from tackling the hardest language early
166
+ - The **Full phase** shows consistent ~68% accuracy across all strategies, suggesting the final shuffled training normalises the differences introduced by ordering
167
+
168
+ ---
169
+
170
+ ## Plots by Strategy
171
+
172
+ ### Strategy 1: English β†’ Hindi β†’ Hinglish β†’ Full
173
+
174
+ | Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
175
+ |---|---|---|---|---|---|
176
+ | English | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[english]_curves.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[english]_cm.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[english]_roc.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[english]_pr.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[english]_f1.png) |
177
+ | Hindi | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hindi]_curves.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hindi]_cm.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hindi]_roc.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hindi]_pr.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hindi]_f1.png) |
178
+ | Hinglish | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hinglish]_curves.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hinglish]_cm.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hinglish]_roc.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hinglish]_pr.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[hinglish]_f1.png) |
179
+ | Full | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[Full]_curves.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[Full]_cm.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[Full]_roc.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[Full]_pr.png) | ![](output/figures/english_to_hindi_to_hinglish/english_to_hindi_to_hinglish_[Full]_f1.png) |
180
+
181
+ ---
182
+
183
+ ### Strategy 2: English β†’ Hinglish β†’ Hindi β†’ Full
184
+
185
+ | Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
186
+ |---|---|---|---|---|---|
187
+ | English | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[english]_curves.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[english]_cm.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[english]_roc.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[english]_pr.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[english]_f1.png) |
188
+ | Hinglish | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hinglish]_curves.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hinglish]_cm.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hinglish]_roc.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hinglish]_pr.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hinglish]_f1.png) |
189
+ | Hindi | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hindi]_curves.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hindi]_cm.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hindi]_roc.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hindi]_pr.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[hindi]_f1.png) |
190
+ | Full | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[Full]_curves.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[Full]_cm.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[Full]_roc.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[Full]_pr.png) | ![](output/figures/english_to_hinglish_to_hindi/english_to_hinglish_to_hindi_[Full]_f1.png) |
191
+
192
+ ---
193
+
194
+ ### Strategy 3: Hindi β†’ English β†’ Hinglish β†’ Full
195
+
196
+ | Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
197
+ |---|---|---|---|---|---|
198
+ | Hindi | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hindi]_curves.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hindi]_cm.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hindi]_roc.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hindi]_pr.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hindi]_f1.png) |
199
+ | English | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[english]_curves.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[english]_cm.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[english]_roc.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[english]_pr.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[english]_f1.png) |
200
+ | Hinglish | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hinglish]_curves.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hinglish]_cm.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hinglish]_roc.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hinglish]_pr.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[hinglish]_f1.png) |
201
+ | Full | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[Full]_curves.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[Full]_cm.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[Full]_roc.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[Full]_pr.png) | ![](output/figures/hindi_to_english_to_hinglish/hindi_to_english_to_hinglish_[Full]_f1.png) |
202
+
203
+ ---
204
+
205
+ ### Strategy 4: Hindi β†’ Hinglish β†’ English β†’ Full
206
+
207
+ | Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
208
+ |---|---|---|---|---|---|
209
+ | Hindi | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hindi]_curves.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hindi]_cm.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hindi]_roc.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hindi]_pr.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hindi]_f1.png) |
210
+ | Hinglish | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hinglish]_curves.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hinglish]_cm.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hinglish]_roc.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hinglish]_pr.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[hinglish]_f1.png) |
211
+ | English | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[english]_curves.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[english]_cm.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[english]_roc.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[english]_pr.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[english]_f1.png) |
212
+ | Full | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[Full]_curves.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[Full]_cm.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[Full]_roc.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[Full]_pr.png) | ![](output/figures/hindi_to_hinglish_to_english/hindi_to_hinglish_to_english_[Full]_f1.png) |
213
+
214
+ ---
215
+
216
+ ### Strategy 5: Hinglish β†’ English β†’ Hindi β†’ Full
217
+
218
+ | Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
219
+ |---|---|---|---|---|---|
220
+ | Hinglish | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hinglish]_curves.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hinglish]_cm.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hinglish]_roc.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hinglish]_pr.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hinglish]_f1.png) |
221
+ | English | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[english]_curves.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[english]_cm.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[english]_roc.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[english]_pr.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[english]_f1.png) |
222
+ | Hindi | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hindi]_curves.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hindi]_cm.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hindi]_roc.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hindi]_pr.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[hindi]_f1.png) |
223
+ | Full | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[Full]_curves.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[Full]_cm.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[Full]_roc.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[Full]_pr.png) | ![](output/figures/hinglish_to_english_to_hindi/hinglish_to_english_to_hindi_[Full]_f1.png) |
224
+
225
+ ---
226
+
227
+ ### Strategy 6: Hinglish β†’ Hindi β†’ English β†’ Full
228
+
229
+ | Phase | Training Curves | Confusion Matrix | ROC Curve | PR Curve | F1 Curve |
230
+ |---|---|---|---|---|---|
231
+ | Hinglish | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hinglish]_curves.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hinglish]_cm.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hinglish]_roc.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hinglish]_pr.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hinglish]_f1.png) |
232
+ | Hindi | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hindi]_curves.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hindi]_cm.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hindi]_roc.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hindi]_pr.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[hindi]_f1.png) |
233
+ | English | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[english]_curves.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[english]_cm.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[english]_roc.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[english]_pr.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[english]_f1.png) |
234
+ | Full | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[Full]_curves.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[Full]_cm.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[Full]_roc.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[Full]_pr.png) | ![](output/figures/hinglish_to_hindi_to_english/hinglish_to_hindi_to_english_[Full]_f1.png) |
235
+
236
+ ---
237
+
238
+ ## Output Files
239
+
240
+ ```
241
+ output/
242
+ β”œβ”€β”€ dataset_splits/
243
+ β”‚ β”œβ”€β”€ train.csv # 17,704 training samples
244
+ β”‚ β”œβ”€β”€ val.csv # 2,950 validation samples
245
+ β”‚ └── test.csv # 8,852 test samples
246
+ β”‚
247
+ β”œβ”€β”€ results_tables/
248
+ β”‚ β”œβ”€β”€ all_strategies_results.csv # All 24 rows (6 strategies Γ— 4 phases)
249
+ β”‚ β”œβ”€β”€ english_to_hindi_to_hinglish_results.csv
250
+ β”‚ β”œβ”€β”€ english_to_hinglish_to_hindi_results.csv
251
+ β”‚ β”œβ”€β”€ hindi_to_english_to_hinglish_results.csv
252
+ β”‚ β”œβ”€β”€ hindi_to_hinglish_to_english_results.csv
253
+ β”‚ β”œβ”€β”€ hinglish_to_english_to_hindi_results.csv
254
+ β”‚ └── hinglish_to_hindi_to_english_results.csv
255
+ β”‚
256
+ └── figures/
257
+ β”œβ”€β”€ language_distribution.png # Pie chart of dataset languages
258
+ β”‚
259
+ β”œβ”€β”€ english_to_hindi_to_hinglish/ # One folder per strategy
260
+ β”‚ β”œβ”€β”€ *_[english]_curves.png # Train/Val accuracy + loss
261
+ β”‚ β”œβ”€β”€ *_[english]_cm.png # Confusion matrix
262
+ β”‚ β”œβ”€β”€ *_[english]_roc.png # ROC curve
263
+ β”‚ β”œβ”€β”€ *_[english]_pr.png # Precision-Recall curve
264
+ β”‚ β”œβ”€β”€ *_[english]_f1.png # F1 vs Threshold curve
265
+ β”‚ β”œβ”€β”€ *_[hindi]_curves.png
266
+ β”‚ β”œβ”€β”€ *_[hindi]_cm.png ...
267
+ β”‚ β”œβ”€β”€ *_[hinglish]_curves.png
268
+ β”‚ β”œβ”€β”€ *_[hinglish]_cm.png ...
269
+ β”‚ β”œβ”€β”€ *_[Full]_curves.png
270
+ β”‚ └── *_[Full]_cm.png ...
271
+ β”‚
272
+ β”œβ”€β”€ english_to_hinglish_to_hindi/
273
+ β”œβ”€β”€ hindi_to_english_to_hinglish/
274
+ β”œβ”€β”€ hindi_to_hinglish_to_english/
275
+ β”œβ”€β”€ hinglish_to_english_to_hindi/
276
+ └── hinglish_to_hindi_to_english/
277
+ ```
278
+
279
+ ---
280
+
281
+ ## How to Run
282
+
283
+ ### Requirements
284
+ ```bash
285
+ pip install tensorflow scikit-learn pandas seaborn matplotlib
286
+ ```
287
+
288
+ You also need GloVe embeddings (`glove.6B.300d.txt`) placed at `/root/glove.6B.300d.txt`:
289
+ ```bash
290
+ wget http://nlp.stanford.edu/data/glove.6B.zip && unzip glove.6B.zip
291
+ ```
292
+
293
+ ### Run
294
+ ```bash
295
+ python main.py
296
+ ```
297
+
298
+ Training was performed on an NVIDIA H200 GPU (Vast.ai) β€” total runtime approximately 15–20 minutes for all 6 strategies.
299
+
300
+ ---
301
+
302
+ ## Project Structure
303
+
304
+ ```
305
+ SASC/
306
+ β”œβ”€β”€ main.py # Full training + evaluation pipeline
307
+ β”œβ”€β”€ dataset.csv # Raw dataset (29,505 samples)
308
+ β”œβ”€β”€ README.md # This file
309
+ └── output/ # All results, figures, and model checkpoints
310
+ ```