fc63
/

gender_prediction_model_from_text

@@ -33,6 +33,7 @@ model-index:
             value: 0.69
 ---
 # Gender Prediction from Text ✍️ → 👩‍🦰👨
 This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
@@ -55,7 +56,7 @@ This model **predicts** the likely **gender** of an anonymous speaker or writer
   - Precision: 0.69
   - Recall: 0.69
-📂 **Evaluation**: [View on GitHub](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
 ---
@@ -135,6 +136,81 @@ Female (Confidence: 84.1%)
 ---
 ## 👨‍🔬 Author & License
 **Author**: Furkan Çoban

             value: 0.69
 ---
 # Gender Prediction from Text ✍️ → 👩‍🦰👨
 This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
   - Precision: 0.69
   - Recall: 0.69
+📂 **Evaluation**: [View on Notebook](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
 ---
 ---
+## 📂 Execution Order & Source Code
+To reproduce the results, it is recommended to run the code in **Google Colab** and **mount your Google Drive**.
+You will need access to the `datasets/` and `models/` folders inside your Drive, which contain preprocessed `.pkl` files and trained checkpoints.
+If you don't have these, you can request them from the author.
+The Jupyter notebooks in the [GitHub repository](https://github.com/fc63/gender-classification) are designed to be run in the following order:
+1. **EuroParl Dataset Normalization**
+   ➤ [`europarl_normalized.ipynb`](https://github.com/fc63/gender-classification/blob/main/europarl_normalized/europarl_normalized.ipynb)
+2. **Learning Rate Finder on Normalized EuroParl**
+   ➤ [`lrfinder.ipynb`](https://github.com/fc63/gender-classification/blob/main/lr_finder/lrfinder.ipynb)
+3. **Training on Normalized Dataset (First Model)**
+   ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/gp_model_first_3_epoch/1.ipynb)
+4. **Best model at step 24750 saved to Drive**
+5. **Lehçe Dataset Creation**
+   ➤ [`lehce1.ipynb`](https://github.com/fc63/gender-classification/blob/main/lehce%20dataset/lehce1.ipynb)
+   ➤ [`lehce dataset`](https://github.com/fc63/gender-classification/tree/main/lehce%20dataset) (the resulting dataset is here as pickle, but I changed the name. otherwise it is the same dataset.)
+7. **Lehçe → English Translation**
+   ➤ [`lehce-eng.ipynb`](https://github.com/fc63/gender-classification/blob/main/pl%20to%20eng%20translate/lehce-eng.ipynb)
+8. **Russian Dataset Creation**
+   ➤ [`rus_gender.ipynb`](https://github.com/fc63/gender-classification/blob/main/rus_gender/rus_gender.ipynb)
+9. **Russian → English Translation**
+   ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/rus_translate/1.ipynb)
+10. **NPTEL Dataset Preprocessing**
+   ➤ [`nptel.ipynb`](https://github.com/fc63/gender-classification/blob/main/nptel%20dataset/nptel.ipynb)
+11. **Combining Lehçe + Russian + NPTEL**
+    ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/combined_3_datasets/1.ipynb)
+12. **Blog Dataset (XML → Pickle)**
+    ➤ [`g_blogs.ipynb`](https://github.com/fc63/gender-classification/blob/main/g_blogs/g_blogs.ipynb)
+13. **Blog Dataset Cleaning & Merging with 3 Datasets**
+    ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/combine_informal/1.ipynb)
+14. **Merging EuroParl + Combined Informal Dataset**
+    ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/mergealldatasets/1.ipynb)
+15. **Evaluation of Model Step 24750**
+    ➤ [`model24750.ipynb`](https://github.com/fc63/gender-classification/blob/main/Evaluate/model24750.ipynb)
+16. **Phase 2: Fine-tuning on Merged Dataset**
+    ➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/gpmodel_v3/1.ipynb)
+17. **Evaluation of Fine-tuned Final Model (gp_modelv3)**
+    ➤ [`modelv3.ipynb`](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
+🧠 **Note:** The final published model on Hugging Face is the one fine-tuned in step 15 and referred to as `gp_modelv3`.
+---
+## 📌 Future Work & Limitations
+I do not want to leave this model at the level of 0.69 accuracy and F1 score.
+As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed.
+The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data.
+To make further progress, I need to create and label a larger dataset myself — which requires a significant amount of time, effort, and cost.
+Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset — and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me.
+---
 ## 👨‍🔬 Author & License
 **Author**: Furkan Çoban