fc63's picture
Update README.md
e6f8378 verified
metadata
language: en
tags:
  - text-classification
  - gender
  - gender-prediction
  - transformers
  - deberta
license: mit
datasets:
  - samzirbo/europarl.en-es.gendered
  - czyzi0/luna-speech-dataset
  - czyzi0/pwr-azon-speech-dataset
  - sagteam/author_profiling
  - kaushalgawri/nptel-en-tags-and-gender-v0
metrics:
  - accuracy
  - f1
  - precision
  - recall
base_model: microsoft/deberta-v3-large
pipeline_tag: text-classification
model-index:
  - name: gender_prediction_model_from_text
    results:
      - task:
          type: text-classification
          name: Text Classification
        metrics:
          - type: f1
            value: 0.69
          - type: accuracy
            value: 0.69
citations:
  - |-
    @misc{fc63_gender1_2025,
      title = {Gender Prediction from Text},
      author = {Çoban, Furkan},
      year = {2025},
      howpublished = {\url{https://doi.org/10.5281/zenodo.15619489}},
      note = {DeBERTa-v3-large model fine-tuned on multi-domain gender-labeled texts}
    }

Gender Prediction from Text ✍️ → 👩‍🦰👨

This model predicts the likely gender of an anonymous speaker or writer based solely on the content of an English text. It is built upon DeBERTa-v3-large and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.

📍 Space link: 🔗 Try it out on Hugging Face Spaces
📁 Model repo: 🔗 View on Hugging Face Hub
🧠 Source code: GitHub


📊 Model Summary

  • Base model: microsoft/deberta-v3-large
  • Fine-tuned on: binary gender classification task (female vs male)
  • Best F1 Score: 0.69 on a balanced multi-domain test set
  • Max token length: 128
  • Evaluation Metrics:
    • F1: 0.69
    • Accuracy: 0.69
    • Precision: 0.69
    • Recall: 0.69

📂 Evaluation: View on Notebook


🧾 Datasets Used

Dataset Domain Type
samzirbo/europarl.en-es.gendered Formal speech (Parliament) English
czyzi0/luna-speech-dataset Phone conversations Polish → Translated
czyzi0/pwr-azon-speech-dataset Phone conversations Polish → Translated
sagteam/author_profiling Social posts Russian → Translated
kaushalgawri/nptel-en-tags-and-gender-v0 Spoken transcripts English
Blog Authorship Corpus Blog posts English

All datasets were normalized, translated if necessary, deduplicated, and balanced via random undersampling to ensure equal representation of both genders.


🛠️ Preprocessing & Training

  • Normalization: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets.
  • Translation: Used Helsinki-NLP/opus-mt-* models for Polish and Russian data.
  • Undersampling: Random undersampling to balance male and female samples.
  • Training Strategy:
    • LR Finder used to optimize learning rate (2.66e-6)
    • Fine-tuned using early stopping on both F1 and loss
    • Step-based evaluation every 250 steps
    • Best checkpoint at step 24,750 saved and evaluated
  • Second Phase Fine-tuning:
    • Performed on full merged dataset for 2 epochs
    • Used cosine learning rate scheduler and warm-up steps

📈 Performance (on full merged test set)

Class Precision Recall F1-Score Accuracy Support
Female 0.70 0.65 0.68 591,027
Male 0.68 0.72 0.70 591,027
Macro Avg 0.69 0.69 0.69 1,182,054
Accuracy 0.69 1,182,054

📦 Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "fc63/gender_prediction_model_from_text"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to(device)

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=1)
    pred = torch.argmax(probs, dim=1).item()
    confidence = round(probs[0][pred].item() * 100, 1)
    gender = "Female" if pred == 0 else "Male"
    return f"{gender} (Confidence: {confidence}%)"
sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow."
print(predict(sample_text))

The Output Of This Sample:

Female (Confidence: 84.1%)

📌 Future Work & Limitations

I do not want to leave this model at the level of 0.69 accuracy and F1 score.

As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed.

The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data.

To make further progress, I need to create and label a larger dataset myself — which requires a significant amount of time, effort, and cost.

Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset — and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me.


👨‍🔬 Author & License

Author: Furkan Çoban
Project: CENG-481 Gender Prediction Model
License: MIT