fc63 commited on
Commit
f2dc063
·
verified ·
1 Parent(s): 06488b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -1
README.md CHANGED
@@ -33,6 +33,7 @@ model-index:
33
  value: 0.69
34
  ---
35
 
 
36
  # Gender Prediction from Text ✍️ → 👩‍🦰👨
37
 
38
  This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
@@ -55,7 +56,7 @@ This model **predicts** the likely **gender** of an anonymous speaker or writer
55
  - Precision: 0.69
56
  - Recall: 0.69
57
 
58
- 📂 **Evaluation**: [View on GitHub](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
59
 
60
  ---
61
 
@@ -135,6 +136,81 @@ Female (Confidence: 84.1%)
135
 
136
  ---
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ## 👨‍🔬 Author & License
139
 
140
  **Author**: Furkan Çoban
 
33
  value: 0.69
34
  ---
35
 
36
+
37
  # Gender Prediction from Text ✍️ → 👩‍🦰👨
38
 
39
  This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
 
56
  - Precision: 0.69
57
  - Recall: 0.69
58
 
59
+ 📂 **Evaluation**: [View on Notebook](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
60
 
61
  ---
62
 
 
136
 
137
  ---
138
 
139
+ ## 📂 Execution Order & Source Code
140
+
141
+ To reproduce the results, it is recommended to run the code in **Google Colab** and **mount your Google Drive**.
142
+ You will need access to the `datasets/` and `models/` folders inside your Drive, which contain preprocessed `.pkl` files and trained checkpoints.
143
+ If you don't have these, you can request them from the author.
144
+
145
+ The Jupyter notebooks in the [GitHub repository](https://github.com/fc63/gender-classification) are designed to be run in the following order:
146
+
147
+ 1. **EuroParl Dataset Normalization**
148
+ ➤ [`europarl_normalized.ipynb`](https://github.com/fc63/gender-classification/blob/main/europarl_normalized/europarl_normalized.ipynb)
149
+
150
+ 2. **Learning Rate Finder on Normalized EuroParl**
151
+ ➤ [`lrfinder.ipynb`](https://github.com/fc63/gender-classification/blob/main/lr_finder/lrfinder.ipynb)
152
+
153
+ 3. **Training on Normalized Dataset (First Model)**
154
+ ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/gp_model_first_3_epoch/1.ipynb)
155
+
156
+ 4. **Best model at step 24750 saved to Drive**
157
+
158
+ 5. **Lehçe Dataset Creation**
159
+ ➤ [`lehce1.ipynb`](https://github.com/fc63/gender-classification/blob/main/lehce%20dataset/lehce1.ipynb)
160
+ ➤ [`lehce dataset`](https://github.com/fc63/gender-classification/tree/main/lehce%20dataset) (the resulting dataset is here as pickle, but I changed the name. otherwise it is the same dataset.)
161
+
162
+ 7. **Lehçe → English Translation**
163
+ ➤ [`lehce-eng.ipynb`](https://github.com/fc63/gender-classification/blob/main/pl%20to%20eng%20translate/lehce-eng.ipynb)
164
+
165
+ 8. **Russian Dataset Creation**
166
+ ➤ [`rus_gender.ipynb`](https://github.com/fc63/gender-classification/blob/main/rus_gender/rus_gender.ipynb)
167
+
168
+ 9. **Russian → English Translation**
169
+ ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/rus_translate/1.ipynb)
170
+
171
+ 10. **NPTEL Dataset Preprocessing**
172
+ ➤ [`nptel.ipynb`](https://github.com/fc63/gender-classification/blob/main/nptel%20dataset/nptel.ipynb)
173
+
174
+ 11. **Combining Lehçe + Russian + NPTEL**
175
+ ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/combined_3_datasets/1.ipynb)
176
+
177
+ 12. **Blog Dataset (XML → Pickle)**
178
+ ➤ [`g_blogs.ipynb`](https://github.com/fc63/gender-classification/blob/main/g_blogs/g_blogs.ipynb)
179
+
180
+ 13. **Blog Dataset Cleaning & Merging with 3 Datasets**
181
+ ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/combine_informal/1.ipynb)
182
+
183
+ 14. **Merging EuroParl + Combined Informal Dataset**
184
+ ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/mergealldatasets/1.ipynb)
185
+
186
+ 15. **Evaluation of Model Step 24750**
187
+ ➤ [`model24750.ipynb`](https://github.com/fc63/gender-classification/blob/main/Evaluate/model24750.ipynb)
188
+
189
+ 16. **Phase 2: Fine-tuning on Merged Dataset**
190
+ ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/gpmodel_v3/1.ipynb)
191
+
192
+ 17. **Evaluation of Fine-tuned Final Model (gp_modelv3)**
193
+ ➤ [`modelv3.ipynb`](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
194
+
195
+ 🧠 **Note:** The final published model on Hugging Face is the one fine-tuned in step 15 and referred to as `gp_modelv3`.
196
+
197
+ ---
198
+
199
+ ## 📌 Future Work & Limitations
200
+
201
+
202
+ I do not want to leave this model at the level of 0.69 accuracy and F1 score.
203
+
204
+ As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed.
205
+
206
+ The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data.
207
+
208
+ To make further progress, I need to create and label a larger dataset myself — which requires a significant amount of time, effort, and cost.
209
+
210
+ Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset — and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me.
211
+
212
+ ---
213
+
214
  ## 👨‍🔬 Author & License
215
 
216
  **Author**: Furkan Çoban