voice-embedder-base / README.md
johbac's picture
Update README.md
1af6dda verified
---
license: mit
datasets:
- ProgramComputer/voxceleb
- mozilla-foundation/common_voice_17_0
language:
- en
base_model:
- openai/whisper-base
tags:
- embedding
- audio
- speech
---
## Model Description
Hey there! This is `voice-embedder-base`, a model that generates speaker embeddings—compact vectors that capture unique vocal characteristics for tasks like speaker verification, clustering, or voice similarity retrieval. It’s built by fine-tuning the `openai/whisper-base` encoder with a contrastive learning approach, using a mix of triplet loss and NT-Xent loss to make embeddings robust and speaker-discriminative. Trained on English speech from Common Voice 17 and VoxCeleb2 datasets, it shines in clean studio settings but holds its own in noisier environments too.
- **Developed by**: John Backsund
- **Model Type**: Speaker Embedding
- **Base Model**: `openai/whisper-base` encoder
- **Embedding Size**: 256
- **Training Data**: Common Voice 17 (en) Train split derived dataset
- **License**: MIT
## Intended Use
This model is great for:
- **Speaker Clustering**: Grouping audio samples by speaker (e.g., for diarization).
- **Speaker Verification**: Checking if two audio clips are from the same speaker.
- **Voice Retrieval**: Finding similar voices in a dataset.
It’s best for clean audio (like studio recordings) but can handle some noise, though performance drops in very noisy settings (e.g., crowded interviews).
## How to Use
To use this model, you’ll need the `voice-finder` library, which includes the custom `VoiceEmbedder` and `VoiceEmbedderFeatureExtractor` classes. Install it from GitHub, then load the model and processor with `transformers`.
```bash
pip install git+https://github.com/johBac97/voice-finder.git
```
```python
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("johbac/voice-embedder-base")
model = AutoModel.from_pretrained("johbac/voice-embedder-base")
# Example: Process audio and get embeddings
audio = [...] # List of audio arrays (16kHz)
features = processor(audio, sampling_rate=16000, return_tensors="pt")
embeddings = model(**features) # Shape: [batch_size, 256]
```
## Training Details
- **Architecture**: Whisper encoder + MLP projector (512 → 256 dims).
- **Loss**: Combined hard-mining triplet loss (supervised) + NT-Xent loss
- **Augmentations**: Gaussian noise (for NT-Xent).
- **Validation Datasets**:
- Common Voice 17 (en) Derived :1,257 speakers (train); 6,270 samples, 2,090 speakers (dev).
- VoxCeleb2 (en) Derived: 12,756 noisy samples, 4,252 speakers (dev, filtered for English).
- **Preprocessing**: Audio resampled to 16kHz, processed with Whisper Feature Extractor, stored in Zarr archives.
See the full [report](https://github.com/johBac97/voice-finder/blob/master/report.md) for details.
## Performance
Here’s how the model performs on the dev sets:
| Dataset | Top-1 Accuracy | Top-5 Accuracy | Equal Error Rate | Avg Same L2 Dist | Avg Diff L2 Dist |
|---------|----------------|----------------|------------------|------------------|------------------|
| Common Voice 17 (en) | 94.13% | 98.17% | 1.05% | 0.5456 | 1.3617 |
| VoxCeleb2 (en) | 14.21% | 22.87% | 18.20% | 0.8152 | 1.1514 |
- **Strengths**: Nails speaker identification in clean audio (Common Voice), with high accuracy and low EER.
- **Weaknesses**: Struggles with noisy audio (VoxCeleb2) due to limited training on real-world noise. Top-5 accuracy (~23%) is still way better than random (~0.1%).
## Limitations
- Performs best on clean, studio-quality audio. Noisy environments (e.g., street interviews) reduce accuracy.
- Trained only on English speech, so performance on other languages is untested.
- Self-supervised loss (NT-Xent) didn’t boost performance as expected, possibly due to augmentations not matching real-world noise.
## Future Improvements
- Add augmentations for real-world noise (e.g., city sounds, background voices).
- Train on more diverse, noisy datasets to improve robustness.
## Citation
Inspired by [Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings](https://arxiv.org/html/2503.10446).