Leia em Português 🇧🇷 | Read this in English 🇺🇸

Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech

A multi-phase fine-tuning approach for robust binary gender classification from raw audio, leveraging cross-domain adaptation for improved generalization.

1. Abstract

This work presents a fine-tuned Wav2Vec2-XLS-R-300M model for binary gender classification (Male / Female) from Brazilian Portuguese speech. The model was trained through a three-phase curriculum — linear probing, full fine-tuning, and cross-domain adaptation — and evaluated on two fully held-out benchmarks: 93.32% accuracy on FalaBrasil CETUC2 (100k+ samples) and 90.45% on emoUERJ. Audio inputs are resampled to 16 kHz and processed as raw waveforms.

Label	Class
`0`	Male
`1`	Female

2. Training

The model was trained in three incremental phases, each building on the previous checkpoint:

Phase	Strategy	Encoder	LR	Batch	Epochs	Dataset	Val Acc
1	Linear Probing	Frozen	2e-5	8	5	small subset	86.63%
2	Full Fine-Tuning	Unfrozen	2e-5	8	4 (ES)	111,212 PT-BR samples	99.56%
3	Domain Adaptation	Unfrozen	5e-6	4	2 (ES)	CV PT-BR — 4,372 balanced	98.51%

Domain Shift. Phase 2 achieved 99.56% on in-domain data but only 63.65% on Common Voice, revealing acoustic overfitting. Phase 3 resolved this through conservative adaptation with a reduced learning rate to prevent catastrophic forgetting.

3. Evaluation

Both benchmarks below are fully out-of-domain — no samples were used during training or validation.

3.1 FalaBrasil CETUC2

Large-scale evaluation on 100,998 samples from the FalaBrasil CETUC2 read-speech corpus (50,000 male / 50,998 female).

Metric	Value
Accuracy	93.32%
F1-Macro	93.31%
Mean Confidence	95.31%

Class	Precision	Recall	F1-Score	Support
Male	89.51%	97.99%	93.56%	50,000
Female	97.83%	88.74%	93.06%	50,998

Confusion Matrix:
                  Pred Male  |  Pred Female
True Male    |    48,996     |     1,004
True Female  |     5,744     |    45,254

Note. The model shows higher recall for Male (97.99%) but higher precision for Female (97.83%), indicating a slight bias toward predicting Male. All top-10 highest-confidence errors were Female samples misclassified as Male.

3.2 emoUERJ

Evaluated on 377 samples from the emoUERJ emotion-in-speech dataset — recorded under entirely different acoustic conditions.

Class	Precision	Recall	F1-Score
Male	0.94	0.85	0.89
Female	0.87	0.95	0.91
Macro	0.91	0.90	0.90

Accuracy: 90.45%

4. Usage

pip install transformers librosa torch

import librosa, torch, torch.nn.functional as F
from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification

model_id  = "Soltsuky/wav2vec2-gender-classification-pt-br"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model     = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()

audio, _ = librosa.load("audio.wav", sr=16000)
inputs   = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    probs = F.softmax(model(**inputs).logits, dim=-1)[0]

label = ["MALE", "FEMALE"][torch.argmax(probs).item()]
print(f"{label} — {probs.max().item()*100:.2f}%")

5. Limitations

Trained exclusively on Brazilian Portuguese; other variants (PT-PT) were not evaluated.
Audio shorter than 1 second may produce lower confidence.
The model exhibits a Male prediction bias (higher Male recall, lower Female recall), likely due to distributional differences between training and evaluation data.
Voice-based gender classification carries ethical implications. This model is for research purposes only and should not be used to identify individuals without consent.

6. Citation

@misc{soltsuky2026wav2vec2gender,
  title  = {Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech},
  author = {Soltsuky},
  year   = {2026},
  url    = {https://huggingface.co/Soltsuky/wav2vec2-gender-classification-pt-br}
}

7. Acknowledgments

Base Model: facebook/wav2vec2-xls-r-300m — Meta AI (MIT)
Evaluation: FalaBrasil CETUC2, emoUERJ, Mozilla Common Voice (CC-0)
Fine-tuning: Soltsuky

Downloads last month: 16

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Soltsuky/wav2vec2-gender-classification-pt-br

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

(860)

this model