Pavankumar9026 commited on
Commit
ffdce64
·
1 Parent(s): 478bb6f

Add README config

Browse files
Files changed (1) hide show
  1. README.md +7 -357
README.md CHANGED
@@ -1,357 +1,7 @@
1
- <!-- anchor tag for back-to-top links -->
2
- <a name="readme-top"></a>
3
-
4
- <!-- HEADER IMAGE -->
5
- <img src="images/header_image.png">
6
-
7
- <!-- SHORT SUMMARY -->
8
- Implemented a hate speech detector for social media comments using deep learning. The fine-tuned BERT model achieved 78% accuracy on the Ethos Hate Speech Dataset, outperforming SimpleRNN/LSTM baselines, and was deployed via a web application and API.
9
-
10
- ---
11
-
12
- ## Table of Contents
13
- <ol>
14
- <li>
15
- <a href="#about-the-project">About The Project</a>
16
- <ul>
17
- <li><a href="#summary">Summary</a></li>
18
- <li><a href="#built-with">Built With</a></li>
19
- </ul>
20
- </li>
21
- <li>
22
- <a href="#motivation">Motivation</a>
23
- </li>
24
- <li>
25
- <a href="#data">Data</a>
26
- </li>
27
- <li>
28
- <a href="#model-building">Model Building</a>
29
- </li>
30
- <li>
31
- <a href="#model-performance">Model Performance</a>
32
- </li>
33
- <ul>
34
- <li><a href="#accuracy">Accuracy</a></li>
35
- <li><a href="#classification-report">Classification Report</a></li>
36
- <li><a href="#confusion-matrix">Confusion Matrix</a></li>
37
- <li><a href="#illustrative-examples">Illustrative Examples</a></li>
38
- </ul>
39
- <li>
40
- <a href="#model-deployment">Model Deployment</a>
41
- </li>
42
- <ul>
43
- <li><a href="#web-application">Web Application</a></li>
44
- <li><a href="#api">API</a></li>
45
- </ul>
46
- <li>
47
- <a href="#getting-started">Getting Started</a>
48
- <ul>
49
- <li><a href="#prerequisites-for-model-training">Prerequisites for Model Training</a></li>
50
- <li><a href="#prerequisites-for-model-deployment">Prerequisites for Model Deployment</a></li>
51
- </ul>
52
- </li>
53
- <li>
54
- <a href="#appendix">Appendix</a>
55
- <ul>
56
- <li><a href="#simplernn-preprocessing-model-architecture-and-hyperparameters">SimpleRNN: Preprocessing, Model Architecture and Hyperparameters</a></li>
57
- </ul>
58
- <ul>
59
- <li><a href="#lstm-preprocessing-model-architecture-and-hyperparameters">LSTM: Preprocessing, Model Architecture and Hyperparameters</a></li>
60
- </ul>
61
- <ul>
62
- <li><a href="#fine-tuned-bert-preprocessing-model-architecture-and-hyperparameters">Fine-Tuned BERT: Preprocessing, Model Architecture and Hyperparameters</a></li>
63
- </ul>
64
- </li>
65
- </ol>
66
-
67
-
68
- <!-- ABOUT THE PROJECT -->
69
- ## About The Project
70
-
71
- ### Summary
72
- + Motivation: Develop a hate speech detector for social media comments.
73
- + Data: Utilized the [ETHOS Hate Speech Detection Dataset](https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset).
74
- + Models: The fine-tuned BERT model demonstrated superior performance (78.0% accuracy) compared to the SimpleRNN (66.3%) and LSTM (70.7%) models.
75
- + Deployment: The fine-tuned BERT model was prepared for production by integrating it into a web application and an API endpoint.
76
-
77
- ### Built With
78
- * [![TensorFlow][TensorFlow-badge]][TensorFlow-url]
79
- * [![scikit-learn][scikit-learn-badge]][scikit-learn-url]
80
- * [![NumPy][NumPy-badge]][NumPy-url]
81
- * [![Pandas][Pandas-badge]][Pandas-url]
82
- * [![Matplotlib][Matplotlib-badge]][Matplotlib-url]
83
- * [![Flask][Flask-badge]][Flask-url]
84
- * [![Python][Python-badge]][Python-url]
85
- * [![Spyder][Spyder-badge]][Spyder-url]
86
- * ![HTML5][HTML5-badge]
87
- * ![CSS3][CSS3-badge]
88
-
89
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
90
-
91
-
92
- <!-- Motivation -->
93
- ## Motivation
94
- + Problem: Hate speech is on the rise globally, especially on social media platforms (source: [United Nations](https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech)).
95
- + Project goal: Utilize deep learning for hate speech detection in social media comments.
96
- + Definition of hate speech: Insulting public speech directed at specific individuals or groups on the basis of characteristics such as race, religion, ethnic origin, national origin, sex, disability, sexual orientation, or gender identity ([Mollas, Chrysopoulou, Karlos, & Tsoumakas, 2022](https://link.springer.com/article/10.1007/s40747-021-00608-2)).
97
-
98
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
99
-
100
-
101
- <!-- Data -->
102
- ## Data
103
- + 998 comments from YouTube and Reddit validated using the Figure-Eight crowdsourcing platform.
104
- + Dataset: [ETHOS Hate Speech Detection Dataset](https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset).
105
- + Balanced data: 43.4% hate speech.
106
- + Comment length: Mean = 112 words (std = 160).
107
-
108
- ![Comment length histogram](plots/histogram_comment_length.png)
109
-
110
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
111
-
112
-
113
- <!-- Model Building -->
114
- ## Model Building
115
- Benchmark models ([Mollas, Chrysopoulou, Karlos, & Tsoumakas, 2022](https://link.springer.com/article/10.1007/s40747-021-00608-2)):
116
- + Random Forest: 65.0% Accuracy
117
- + Support Vector Machine: 66.4% Accuracy
118
-
119
- Comparison of three deep learning models:
120
- + SimpleRNN
121
- + Preprocessing, model architecture and hyperparameters: [See details](#simplernn-preprocessing-model-architecture-and-hyperparameters)
122
- + LSTM
123
- + Preprocessing, model architecture and hyperparameters: [See details](#lstm-preprocessing-model-architecture-and-hyperparameters)
124
- + Fine-tuned BERT
125
- + Implementation with TensorFlow Hub
126
- + Small BERT model: [small_bert/bert_en_uncased_L-4_H-512_A-8](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2)
127
- + Preprocessing, model architecture and hyperparameters: [See details](#fine-tuned-bert-preprocessing-model-architecture-and-hyperparameters)
128
-
129
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
130
-
131
-
132
- <!-- Model Performance -->
133
- ## Model Performance
134
- ### Accuracy
135
- | | SimpleRNN | LSTM | Fine-Tuned BERT |
136
- |-------------------|-----------|----------|-----------------|
137
- | Training Accuracy | 91.8% | 100% | 99.9% |
138
- | Test Accuracy | 66.3% | 70.7% | 78.0% |
139
-
140
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
141
-
142
- ### Classification Report
143
- The following classification reports present the performance metrics of the trained models on the test data.
144
-
145
- **SimpleRNN**
146
- | | Precision | Recall | F1 Score |
147
- |-----------------|-----------|--------|----------|
148
- | No Hate Speech | 0.69 | 0.71 | 0.70 |
149
- | Hate Speech | 0.63 | 0.61 | 0.62 |
150
- | Accuracy | | | 0.66 |
151
-
152
- **LSTM**
153
- | | Precision | Recall | F1 Score |
154
- |-----------------|-----------|--------|----------|
155
- | No Hate Speech | 0.73 | 0.75 | 0.74 |
156
- | Hate Speech | 0.68 | 0.66 | 0.67 |
157
- | Accuracy | | | 0.71 |
158
-
159
- **Fine-Tuned BERT**
160
- | | Precision | Recall | F1 Score |
161
- |-----------------|-----------|--------|----------|
162
- | No Hate Speech | 0.84 | 0.74 | 0.79 |
163
- | Hate Speech | 0.72 | 0.83 | 0.77 |
164
- | Accuracy | | | 0.78 |
165
-
166
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
167
-
168
- ### Confusion Matrix
169
- Confusion matrices of the three models based on the test data.
170
- | SimpleRNN | LSTM | Fine-Tuned BERT |
171
- |-----------------------|-----------------------|-----------------------|
172
- | ![Model 1 Confusion Matrix](./plots/model1_confusion_matrix.png) | ![Model 2 Confusion Matrix](./plots/model2_confusion_matrix.png) | ![Model 3 Confusion Matrix](./plots/model3_confusion_matrix.png) |
173
-
174
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
175
-
176
- ### Illustrative Examples
177
- Here are some example comments from the test data along with their true label and model predictions.
178
-
179
- | | True Label | SimpleRNN | LSTM | Fine-Tuned BERT |
180
- |----------|----------|----------|----------|----------|
181
- | Women deserve to be abused, I guess. | Hate | Hate | Hate | Hate |
182
- | Christians are all demonic. | Hate | No Hate | Hate | Hate |
183
- | Muslims want war! So take it to them first | Hate | Hate | No Hate | Hate |
184
- | Fat lesbians | Hate | Hate | Hate | Hate |
185
- | What imaginary gender are you? | Hate | No Hate | No Hate | Hate |
186
-
187
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
188
-
189
-
190
- <!-- MODEL DEPLOYMENT -->
191
- ## Model Deployment
192
- I deployed the fine-tuned BERT model through both a web application and an API.
193
-
194
- ### Web Application
195
- Utilizing the Flask framework, HTML, and CSS, I created a web application with a user-friendly interface for users to input text and receive predictions on whether it contains hate speech.
196
-
197
- <img src="./images/model_deployment_01.PNG" alt="Deployment example 1" style="width: 300px;"> <img src="./images/model_deployment_02.PNG" alt="Deployment example 2" style="width: 300px;">
198
-
199
- ### API
200
- I developed an API endpoint to enable integration with other applications or services by leveraging the Flask framework and utilized <a href="https://www.postman.com/">Postman</a> for testing and documenting the API.
201
-
202
- API documentation: [See here](https://documenter.getpostman.com/view/28394113/2s946eBERv)
203
-
204
- ![Model deployment API](/images/model_deployment_api.gif)
205
-
206
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
207
-
208
-
209
- <!-- GETTING STARTED -->
210
- ## Getting Started
211
-
212
- ### Prerequisites for Model Training
213
- This is a list of the Python packages you need.
214
- <ul>
215
- <li>TensorFlow</li>
216
- <li>TensorFlow Hub</li>
217
- <li>TensorFlow Text</li>
218
- <li>Scikit-Learn</li>
219
- <li>NumPy</li>
220
- <li>Pandas</li>
221
- <li>Matplotlib</li>
222
- </ul>
223
-
224
- ### Prerequisites for Model Deployment
225
- This is a list of the Python packages you need.
226
- <ul>
227
- <li>TensorFlow</li>
228
- <li>TensorFlow Text</li>
229
- <li>NumPy</li>
230
- <li>Flask</li>
231
- <li>Flask-WTF</li>
232
- <li>WTForms</li>
233
- <li>Python-dotenv</li>
234
- </ul>
235
-
236
- To enhance security, create a `.env` file and create a secret key for the Flask application. Store the secret key in the `.env` file and utilize the `python-dotenv` library to retrieve it.
237
- ```
238
- SECRET_KEY = "Your_secret_key_here"
239
- ```
240
-
241
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
242
-
243
-
244
- <!-- APPENDIX -->
245
- ## Appendix
246
- ### SimpleRNN: Preprocessing, Model Architecture and Hyperparameters
247
-
248
- **Preprocessing**
249
- Tokenizer vocabulary size: 5000
250
- Padded sequence length: 15
251
- Embedding dimension: 50
252
-
253
- **Model Architecture**
254
- | Layer (type) | Output Shape | Param # | Activation |
255
- | ------------ | --------------- | ------- | ---------- |
256
- | Embedding | (None, 15, 50) | 250050 | |
257
- | SimpleRNN | (None, 15, 128) | 22912 | tanh |
258
- | SimpleRNN | (None, 128) | 32896 | tanh |
259
- | Dense | (None, 64) | 8256 | relu |
260
- | Dense | (None, 1) | 65 | sigmoid |
261
-
262
- Total params: 314,179
263
- Trainable params: 314,179
264
- Non-trainable params: 0
265
-
266
- **Hyperparameters**
267
- Optimizer: Adam
268
- Learning rate: 0.001
269
- Loss: Binary Crossentropy
270
- Epochs: 100
271
- Batch size: 8
272
- Dropout rate: 50%
273
- Early stopping metric: Accuracy
274
-
275
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
276
-
277
- ### LSTM: Preprocessing, Model Architecture and Hyperparameters
278
-
279
-
280
- **Preprocessing**
281
- Tokenizer vocabulary size: 5000
282
- Padded sequence length: 150
283
- Embedding dimension: 50
284
-
285
- **Model Architecture**
286
- | Layer (type) | Output Shape | Param # | Activation |
287
- | ------------ | ---------------- | ------- | ---------- |
288
- | Embedding | (None, 150, 50) | 250050 | |
289
- | LSTM | (None, 150, 128) | 91648 | tanh |
290
- | LSTM | (None, 128) | 131584 | tanh |
291
- | Dense | (None, 64) | 8256 | relu |
292
- | Dense | (None, 1) | 65 | sigmoid |
293
-
294
- Total params: 481,603
295
- Trainable params: 481,603
296
- Non-trainable params: 0
297
-
298
- **Hyperparameters**
299
- Optimizer: Adam
300
- Learning rate: 0.001
301
- Loss: Binary Crossentropy
302
- Epochs: 100
303
- Batch size: 32
304
- Dropout rate: 50%
305
- Early stopping metric: Accuracy
306
-
307
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
308
-
309
- ### Fine-Tuned BERT: Preprocessing, Model Architecture and Hyperparameters
310
-
311
- **Preprocessing**
312
- Text preprocessing for BERT models: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3
313
-
314
- **Model Architecture**
315
- | Layer (type) | Output Shape | Param # | Activation |
316
- | ------------- | ---------------- | -------- | ---------- |
317
- | Text Input | [(None,)] | 0 | |
318
- | Preprocessing | input_type_ids: (None, 128)<br> input_mask: (None, 128)<br> input_word_ids: (None, 128) | 0 | |
319
- | BERT | (None, 512) | 28763649 | |
320
- | Dropout | (None, 512) | 0 | |
321
- | Dense | (None, 128) | 65664 | relu |
322
- | Dense | (None, 1) | 129 | sigmoid |
323
-
324
- Total params: 28,829,442
325
- Trainable params: 28,829,441
326
- Non-trainable params: 1
327
-
328
- **Hyperparameters**
329
- Optimizer: Adam
330
- Learning rate: 0.0001
331
- Loss: Binary Crossentropy
332
- Epochs: 100
333
- Batch size: 8
334
- Dropout rate: 50%
335
- Early stopping metric: Accuracy
336
-
337
- <p align="right">(<a href="#readme-top">back to top</a>)</p>
338
-
339
- <!-- MARKDOWN LINKS -->
340
- [TensorFlow-badge]: https://img.shields.io/badge/TensorFlow-%23FF6F00.svg?style=for-the-badge&logo=TensorFlow&logoColor=white
341
- [TensorFlow-url]: https://www.tensorflow.org/
342
- [scikit-learn-badge]: https://img.shields.io/badge/scikit--learn-%23F7931E.svg?style=for-the-badge&logo=scikit-learn&logoColor=white
343
- [scikit-learn-url]: https://scikit-learn.org/stable/
344
- [NumPy-badge]: https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white
345
- [NumPy-url]: https://numpy.org/
346
- [Pandas-badge]: https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white
347
- [Pandas-url]: https://pandas.pydata.org/
348
- [Matplotlib-badge]: https://img.shields.io/badge/Matplotlib-%23ffffff.svg?style=for-the-badge&logo=Matplotlib&logoColor=black
349
- [Matplotlib-url]: https://matplotlib.org/
350
- [Flask-badge]: https://img.shields.io/badge/flask-%23000.svg?style=for-the-badge&logo=flask&logoColor=white
351
- [Flask-url]: https://flask.palletsprojects.com/en/2.3.x/
352
- [Python-badge]: https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54
353
- [Python-url]: https://www.python.org/
354
- [Spyder-badge]: https://img.shields.io/badge/Spyder-838485?style=for-the-badge&logo=spyder%20ide&logoColor=maroon
355
- [Spyder-url]: https://www.spyder-ide.org/
356
- [HTML5-badge]: https://img.shields.io/badge/html5-%23E34F26.svg?style=for-the-badge&logo=html5&logoColor=white
357
- [CSS3-badge]: https://img.shields.io/badge/css3-%231572B6.svg?style=for-the-badge&logo=css3&logoColor=white
 
1
+ ---
2
+ title: Hate Speech Detector
3
+ sdk: gradio
4
+ sdk_version: "4.0.0"
5
+ app_file: app.py
6
+ pinned: false
7
+ ---