Update README.md
Browse files
README.md
CHANGED
|
@@ -33,6 +33,7 @@ model-index:
|
|
| 33 |
value: 0.69
|
| 34 |
---
|
| 35 |
|
|
|
|
| 36 |
# Gender Prediction from Text ✍️ → 👩🦰👨
|
| 37 |
|
| 38 |
This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
|
|
@@ -55,7 +56,7 @@ This model **predicts** the likely **gender** of an anonymous speaker or writer
|
|
| 55 |
- Precision: 0.69
|
| 56 |
- Recall: 0.69
|
| 57 |
|
| 58 |
-
📂 **Evaluation**: [View on
|
| 59 |
|
| 60 |
---
|
| 61 |
|
|
@@ -135,6 +136,81 @@ Female (Confidence: 84.1%)
|
|
| 135 |
|
| 136 |
---
|
| 137 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
## 👨🔬 Author & License
|
| 139 |
|
| 140 |
**Author**: Furkan Çoban
|
|
|
|
| 33 |
value: 0.69
|
| 34 |
---
|
| 35 |
|
| 36 |
+
|
| 37 |
# Gender Prediction from Text ✍️ → 👩🦰👨
|
| 38 |
|
| 39 |
This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
|
|
|
|
| 56 |
- Precision: 0.69
|
| 57 |
- Recall: 0.69
|
| 58 |
|
| 59 |
+
📂 **Evaluation**: [View on Notebook](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
|
| 60 |
|
| 61 |
---
|
| 62 |
|
|
|
|
| 136 |
|
| 137 |
---
|
| 138 |
|
| 139 |
+
## 📂 Execution Order & Source Code
|
| 140 |
+
|
| 141 |
+
To reproduce the results, it is recommended to run the code in **Google Colab** and **mount your Google Drive**.
|
| 142 |
+
You will need access to the `datasets/` and `models/` folders inside your Drive, which contain preprocessed `.pkl` files and trained checkpoints.
|
| 143 |
+
If you don't have these, you can request them from the author.
|
| 144 |
+
|
| 145 |
+
The Jupyter notebooks in the [GitHub repository](https://github.com/fc63/gender-classification) are designed to be run in the following order:
|
| 146 |
+
|
| 147 |
+
1. **EuroParl Dataset Normalization**
|
| 148 |
+
➤ [`europarl_normalized.ipynb`](https://github.com/fc63/gender-classification/blob/main/europarl_normalized/europarl_normalized.ipynb)
|
| 149 |
+
|
| 150 |
+
2. **Learning Rate Finder on Normalized EuroParl**
|
| 151 |
+
➤ [`lrfinder.ipynb`](https://github.com/fc63/gender-classification/blob/main/lr_finder/lrfinder.ipynb)
|
| 152 |
+
|
| 153 |
+
3. **Training on Normalized Dataset (First Model)**
|
| 154 |
+
➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/gp_model_first_3_epoch/1.ipynb)
|
| 155 |
+
|
| 156 |
+
4. **Best model at step 24750 saved to Drive**
|
| 157 |
+
|
| 158 |
+
5. **Lehçe Dataset Creation**
|
| 159 |
+
➤ [`lehce1.ipynb`](https://github.com/fc63/gender-classification/blob/main/lehce%20dataset/lehce1.ipynb)
|
| 160 |
+
➤ [`lehce dataset`](https://github.com/fc63/gender-classification/tree/main/lehce%20dataset) (the resulting dataset is here as pickle, but I changed the name. otherwise it is the same dataset.)
|
| 161 |
+
|
| 162 |
+
7. **Lehçe → English Translation**
|
| 163 |
+
➤ [`lehce-eng.ipynb`](https://github.com/fc63/gender-classification/blob/main/pl%20to%20eng%20translate/lehce-eng.ipynb)
|
| 164 |
+
|
| 165 |
+
8. **Russian Dataset Creation**
|
| 166 |
+
➤ [`rus_gender.ipynb`](https://github.com/fc63/gender-classification/blob/main/rus_gender/rus_gender.ipynb)
|
| 167 |
+
|
| 168 |
+
9. **Russian → English Translation**
|
| 169 |
+
➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/rus_translate/1.ipynb)
|
| 170 |
+
|
| 171 |
+
10. **NPTEL Dataset Preprocessing**
|
| 172 |
+
➤ [`nptel.ipynb`](https://github.com/fc63/gender-classification/blob/main/nptel%20dataset/nptel.ipynb)
|
| 173 |
+
|
| 174 |
+
11. **Combining Lehçe + Russian + NPTEL**
|
| 175 |
+
➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/combined_3_datasets/1.ipynb)
|
| 176 |
+
|
| 177 |
+
12. **Blog Dataset (XML → Pickle)**
|
| 178 |
+
➤ [`g_blogs.ipynb`](https://github.com/fc63/gender-classification/blob/main/g_blogs/g_blogs.ipynb)
|
| 179 |
+
|
| 180 |
+
13. **Blog Dataset Cleaning & Merging with 3 Datasets**
|
| 181 |
+
➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/combine_informal/1.ipynb)
|
| 182 |
+
|
| 183 |
+
14. **Merging EuroParl + Combined Informal Dataset**
|
| 184 |
+
➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/mergealldatasets/1.ipynb)
|
| 185 |
+
|
| 186 |
+
15. **Evaluation of Model Step 24750**
|
| 187 |
+
➤ [`model24750.ipynb`](https://github.com/fc63/gender-classification/blob/main/Evaluate/model24750.ipynb)
|
| 188 |
+
|
| 189 |
+
16. **Phase 2: Fine-tuning on Merged Dataset**
|
| 190 |
+
➤ [`1.ipynb`](https://github.com/fc63/gender-classification/blob/main/gpmodel_v3/1.ipynb)
|
| 191 |
+
|
| 192 |
+
17. **Evaluation of Fine-tuned Final Model (gp_modelv3)**
|
| 193 |
+
➤ [`modelv3.ipynb`](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
|
| 194 |
+
|
| 195 |
+
🧠 **Note:** The final published model on Hugging Face is the one fine-tuned in step 15 and referred to as `gp_modelv3`.
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## 📌 Future Work & Limitations
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
I do not want to leave this model at the level of 0.69 accuracy and F1 score.
|
| 203 |
+
|
| 204 |
+
As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed.
|
| 205 |
+
|
| 206 |
+
The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data.
|
| 207 |
+
|
| 208 |
+
To make further progress, I need to create and label a larger dataset myself — which requires a significant amount of time, effort, and cost.
|
| 209 |
+
|
| 210 |
+
Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset — and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me.
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
## 👨🔬 Author & License
|
| 215 |
|
| 216 |
**Author**: Furkan Çoban
|