Leia em Português 🇧🇷 | Read this in English 🇺🇸

Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech

A multi-phase fine-tuning approach for robust binary gender classification from raw audio, leveraging cross-domain adaptation for improved generalization.

PT-BR PyTorch HuggingFace CETUC2 F1 emoUERJ Acc License


1. Abstract

This work presents a fine-tuned Wav2Vec2-XLS-R-300M model for binary gender classification (Male / Female) from Brazilian Portuguese speech. The model was trained through a three-phase curriculum — linear probing, full fine-tuning, and cross-domain adaptation — and evaluated on two fully held-out benchmarks: 93.32% accuracy on FalaBrasil CETUC2 (100k+ samples) and 90.45% on emoUERJ. Audio inputs are resampled to 16 kHz and processed as raw waveforms.

Label Class
0 Male
1 Female

2. Training

The model was trained in three incremental phases, each building on the previous checkpoint:

Phase Strategy Encoder LR Batch Epochs Dataset Val Acc
1 Linear Probing Frozen 2e-5 8 5 small subset 86.63%
2 Full Fine-Tuning Unfrozen 2e-5 8 4 (ES) 111,212 PT-BR samples 99.56%
3 Domain Adaptation Unfrozen 5e-6 4 2 (ES) CV PT-BR — 4,372 balanced 98.51%

Domain Shift. Phase 2 achieved 99.56% on in-domain data but only 63.65% on Common Voice, revealing acoustic overfitting. Phase 3 resolved this through conservative adaptation with a reduced learning rate to prevent catastrophic forgetting.


3. Evaluation

Both benchmarks below are fully out-of-domain — no samples were used during training or validation.

3.1 FalaBrasil CETUC2

Large-scale evaluation on 100,998 samples from the FalaBrasil CETUC2 read-speech corpus (50,000 male / 50,998 female).

Metric Value
Accuracy 93.32%
F1-Macro 93.31%
Mean Confidence 95.31%
Class Precision Recall F1-Score Support
Male 89.51% 97.99% 93.56% 50,000
Female 97.83% 88.74% 93.06% 50,998
Confusion Matrix:
                  Pred Male  |  Pred Female
True Male    |    48,996     |     1,004
True Female  |     5,744     |    45,254

Note. The model shows higher recall for Male (97.99%) but higher precision for Female (97.83%), indicating a slight bias toward predicting Male. All top-10 highest-confidence errors were Female samples misclassified as Male.

3.2 emoUERJ

Evaluated on 377 samples from the emoUERJ emotion-in-speech dataset — recorded under entirely different acoustic conditions.

Class Precision Recall F1-Score
Male 0.94 0.85 0.89
Female 0.87 0.95 0.91
Macro 0.91 0.90 0.90

Accuracy: 90.45%


4. Usage

pip install transformers librosa torch
import librosa, torch, torch.nn.functional as F
from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification

model_id  = "Soltsuky/wav2vec2-gender-classification-pt-br"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model     = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()

audio, _ = librosa.load("audio.wav", sr=16000)
inputs   = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    probs = F.softmax(model(**inputs).logits, dim=-1)[0]

label = ["MALE", "FEMALE"][torch.argmax(probs).item()]
print(f"{label} — {probs.max().item()*100:.2f}%")

5. Limitations

  • Trained exclusively on Brazilian Portuguese; other variants (PT-PT) were not evaluated.
  • Audio shorter than 1 second may produce lower confidence.
  • The model exhibits a Male prediction bias (higher Male recall, lower Female recall), likely due to distributional differences between training and evaluation data.
  • Voice-based gender classification carries ethical implications. This model is for research purposes only and should not be used to identify individuals without consent.

6. Citation

@misc{soltsuky2026wav2vec2gender,
  title  = {Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech},
  author = {Soltsuky},
  year   = {2026},
  url    = {https://huggingface.co/Soltsuky/wav2vec2-gender-classification-pt-br}
}

7. Acknowledgments


Downloads last month
16
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Soltsuky/wav2vec2-gender-classification-pt-br

Finetuned
(860)
this model