Leia em Português 🇧🇷 | Read this in English 🇺🇸
Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech
A multi-phase fine-tuning approach for robust binary gender classification from raw audio, leveraging cross-domain adaptation for improved generalization.
1. Abstract
This work presents a fine-tuned Wav2Vec2-XLS-R-300M model for binary gender classification (Male / Female) from Brazilian Portuguese speech. The model was trained through a three-phase curriculum — linear probing, full fine-tuning, and cross-domain adaptation — and evaluated on two fully held-out benchmarks: 93.32% accuracy on FalaBrasil CETUC2 (100k+ samples) and 90.45% on emoUERJ. Audio inputs are resampled to 16 kHz and processed as raw waveforms.
| Label | Class |
|---|---|
0 |
Male |
1 |
Female |
2. Training
The model was trained in three incremental phases, each building on the previous checkpoint:
| Phase | Strategy | Encoder | LR | Batch | Epochs | Dataset | Val Acc |
|---|---|---|---|---|---|---|---|
| 1 | Linear Probing | Frozen | 2e-5 | 8 | 5 | small subset | 86.63% |
| 2 | Full Fine-Tuning | Unfrozen | 2e-5 | 8 | 4 (ES) | 111,212 PT-BR samples | 99.56% |
| 3 | Domain Adaptation | Unfrozen | 5e-6 | 4 | 2 (ES) | CV PT-BR — 4,372 balanced | 98.51% |
Domain Shift. Phase 2 achieved 99.56% on in-domain data but only 63.65% on Common Voice, revealing acoustic overfitting. Phase 3 resolved this through conservative adaptation with a reduced learning rate to prevent catastrophic forgetting.
3. Evaluation
Both benchmarks below are fully out-of-domain — no samples were used during training or validation.
3.1 FalaBrasil CETUC2
Large-scale evaluation on 100,998 samples from the FalaBrasil CETUC2 read-speech corpus (50,000 male / 50,998 female).
| Metric | Value |
|---|---|
| Accuracy | 93.32% |
| F1-Macro | 93.31% |
| Mean Confidence | 95.31% |
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Male | 89.51% | 97.99% | 93.56% | 50,000 |
| Female | 97.83% | 88.74% | 93.06% | 50,998 |
Confusion Matrix:
Pred Male | Pred Female
True Male | 48,996 | 1,004
True Female | 5,744 | 45,254
Note. The model shows higher recall for Male (97.99%) but higher precision for Female (97.83%), indicating a slight bias toward predicting Male. All top-10 highest-confidence errors were Female samples misclassified as Male.
3.2 emoUERJ
Evaluated on 377 samples from the emoUERJ emotion-in-speech dataset — recorded under entirely different acoustic conditions.
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Male | 0.94 | 0.85 | 0.89 |
| Female | 0.87 | 0.95 | 0.91 |
| Macro | 0.91 | 0.90 | 0.90 |
Accuracy: 90.45%
4. Usage
pip install transformers librosa torch
import librosa, torch, torch.nn.functional as F
from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification
model_id = "Soltsuky/wav2vec2-gender-classification-pt-br"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()
audio, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
probs = F.softmax(model(**inputs).logits, dim=-1)[0]
label = ["MALE", "FEMALE"][torch.argmax(probs).item()]
print(f"{label} — {probs.max().item()*100:.2f}%")
5. Limitations
- Trained exclusively on Brazilian Portuguese; other variants (PT-PT) were not evaluated.
- Audio shorter than 1 second may produce lower confidence.
- The model exhibits a Male prediction bias (higher Male recall, lower Female recall), likely due to distributional differences between training and evaluation data.
- Voice-based gender classification carries ethical implications. This model is for research purposes only and should not be used to identify individuals without consent.
6. Citation
@misc{soltsuky2026wav2vec2gender,
title = {Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech},
author = {Soltsuky},
year = {2026},
url = {https://huggingface.co/Soltsuky/wav2vec2-gender-classification-pt-br}
}
7. Acknowledgments
- Base Model: facebook/wav2vec2-xls-r-300m — Meta AI (MIT)
- Evaluation: FalaBrasil CETUC2, emoUERJ, Mozilla Common Voice (CC-0)
- Fine-tuning: Soltsuky
- Downloads last month
- 16
Model tree for Soltsuky/wav2vec2-gender-classification-pt-br
Base model
facebook/wav2vec2-xls-r-300m