Gender Voice Classifier β Sub-1MB Bi-LSTM
A lightweight voice gender classifier designed as a preprocessing component for real-time voice AI pipelines. Runs on CPU in under 5 ms, exported to ONNX, no PyTorch required at inference time.
Model size: 0.64 MB | Parameters: 166K | Inference: ~4 ms (CPU, single-threaded)
Motivation
We build voice AI assistants for clients across European markets. In languages with grammatical gender (Polish, German, French, Spanish, Italian), addressing someone requires correct inflection of adjectives, verb forms, and honorifics. Human agents recognise the caller's gender from their voice in the first seconds of a call and adjust naturally. This model gives voice pipelines the same capability.
Usage
import numpy as np
import librosa
import onnxruntime as ort
# Load model
session = ort.InferenceSession("gender_classifier_200k.onnx")
# Load and preprocess audio (16kHz mono, 3s clip)
audio, _ = librosa.load("your_audio.wav", sr=16000, mono=True)
audio = audio[:48000] # truncate to 3s
# Extract MFCCs
mfcc = librosa.feature.mfcc(
y=audio, sr=16000, n_mfcc=40, n_fft=512, hop_length=160, n_mels=80
)
mfcc = (mfcc - mfcc.mean(axis=1, keepdims=True)) / (mfcc.std(axis=1, keepdims=True) + 1e-8)
mfcc = mfcc[np.newaxis, :, :].astype(np.float32) # (1, 40, T)
# Predict
logit = session.run(["logits"], {"mfcc": mfcc})[0][0, 0]
prob_female = 1 / (1 + np.exp(-logit))
gender = "female" if prob_female > 0.5 else "male"
print(gender, f"{prob_female:.2%}")
Benchmark Results
Evaluated on four held-out test sets (none seen during training):
| Dataset | Accuracy | Male Acc | Female Acc | F1 | Avg Inference |
|---|---|---|---|---|---|
| LibriSpeech test-clean | 94.4% | 95.0% | 93.8% | 0.947 | 4.2 ms |
| LibriSpeech test-other | 90.9% | 83.6% | 99.3% | 0.911 | 3.8 ms |
| FLEURS test (EN/DE/FR/ES/IT) | 94.3% | 90.4% | 99.5% | 0.938 | 6.6 ms |
| Edinburgh International Accents (EdAcc) | 75.6% | 86.1% | 50.7% | 0.551 | 3.7 ms |
Inference measured on CPU, single-threaded ONNX Runtime.
Scope: The target distribution is standard-accent speech in the five training languages (EN, DE, FR, ES, IT). EdAcc is included as an out-of-scope stress test on strongly accented international English; it is not representative of the production deployment target. For speaker populations beyond the target distribution, retrain with accented corpora such as VCTK.
Architecture
- 2-layer Bidirectional LSTM, hidden size 64 per direction
- Soft attention pooling over time steps
- Classifier head: Linear(128β32) β ReLU β Dropout β Linear(32β1)
- Input: 40 MFCC coefficients, 3-second clips at 16 kHz
- Output: single logit, sigmoid > 0.5 β female
Training
- Data: LibriSpeech train-clean-100 (EN) + FLEURS train split (EN/DE/FR/ES/IT)
- Balanced: 50/50 male/female by undersampling
- Optimizer: AdamW, lr=1e-3, cosine annealing, 20 epochs
- Infrastructure: Single T4 GPU via Modal.com
Limitations
- Accented speech: The model targets standard-accent speech in the five training languages. On strongly accented international English (see EdAcc above), accuracy degrades β retrain with accented corpora such as VCTK for broader speaker populations.
- Binary classification only: Does not accommodate non-binary, transgender, or intersex individuals. Suitable for cases where a binary routing signal is sufficient.
- 5 Western European languages: Not tested on tonal languages or non-European speech.
- Clean audio only: Not benchmarked under heavy noise or telephony compression.
Citation
@misc{bidus2026gender,
title = {A Sub-1MB Bi-LSTM Gender Classifier for Real-Time Voice Pipelines},
author = {Bidu\'s, Kamil},
year = {2026},
howpublished = {arXiv preprint},
# arxiv: add once published
}
Paper: link will be added once the arXiv submission is public.
Datasets used to train syntropicsignal-ai/gender-voice-classifier
google/fleurs
Evaluation results
- accuracy on LibriSpeech test-cleanself-reported0.944
- accuracy on FLEURS test (EN/DE/FR/ES/IT)self-reported0.943