|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- ProgramComputer/voxceleb |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- openai/whisper-base |
|
|
tags: |
|
|
- embedding |
|
|
- audio |
|
|
- speech |
|
|
--- |
|
|
|
|
|
## Model Description |
|
|
Hey there! This is `voice-embedder-base`, a model that generates speaker embeddings—compact vectors that capture unique vocal characteristics for tasks like speaker verification, clustering, or voice similarity retrieval. It’s built by fine-tuning the `openai/whisper-base` encoder with a contrastive learning approach, using a mix of triplet loss and NT-Xent loss to make embeddings robust and speaker-discriminative. Trained on English speech from Common Voice 17 and VoxCeleb2 datasets, it shines in clean studio settings but holds its own in noisier environments too. |
|
|
|
|
|
- **Developed by**: John Backsund |
|
|
- **Model Type**: Speaker Embedding |
|
|
- **Base Model**: `openai/whisper-base` encoder |
|
|
- **Embedding Size**: 256 |
|
|
- **Training Data**: Common Voice 17 (en) Train split derived dataset |
|
|
- **License**: MIT |
|
|
|
|
|
## Intended Use |
|
|
This model is great for: |
|
|
- **Speaker Clustering**: Grouping audio samples by speaker (e.g., for diarization). |
|
|
- **Speaker Verification**: Checking if two audio clips are from the same speaker. |
|
|
- **Voice Retrieval**: Finding similar voices in a dataset. |
|
|
|
|
|
It’s best for clean audio (like studio recordings) but can handle some noise, though performance drops in very noisy settings (e.g., crowded interviews). |
|
|
|
|
|
## How to Use |
|
|
To use this model, you’ll need the `voice-finder` library, which includes the custom `VoiceEmbedder` and `VoiceEmbedderFeatureExtractor` classes. Install it from GitHub, then load the model and processor with `transformers`. |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/johBac97/voice-finder.git |
|
|
``` |
|
|
|
|
|
```python |
|
|
from transformers import AutoProcessor, AutoModel |
|
|
processor = AutoProcessor.from_pretrained("johbac/voice-embedder-base") |
|
|
model = AutoModel.from_pretrained("johbac/voice-embedder-base") |
|
|
|
|
|
# Example: Process audio and get embeddings |
|
|
audio = [...] # List of audio arrays (16kHz) |
|
|
features = processor(audio, sampling_rate=16000, return_tensors="pt") |
|
|
embeddings = model(**features) # Shape: [batch_size, 256] |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
- **Architecture**: Whisper encoder + MLP projector (512 → 256 dims). |
|
|
- **Loss**: Combined hard-mining triplet loss (supervised) + NT-Xent loss |
|
|
- **Augmentations**: Gaussian noise (for NT-Xent). |
|
|
- **Validation Datasets**: |
|
|
- Common Voice 17 (en) Derived :1,257 speakers (train); 6,270 samples, 2,090 speakers (dev). |
|
|
- VoxCeleb2 (en) Derived: 12,756 noisy samples, 4,252 speakers (dev, filtered for English). |
|
|
- **Preprocessing**: Audio resampled to 16kHz, processed with Whisper Feature Extractor, stored in Zarr archives. |
|
|
|
|
|
See the full [report](https://github.com/johBac97/voice-finder/blob/master/report.md) for details. |
|
|
|
|
|
## Performance |
|
|
Here’s how the model performs on the dev sets: |
|
|
|
|
|
| Dataset | Top-1 Accuracy | Top-5 Accuracy | Equal Error Rate | Avg Same L2 Dist | Avg Diff L2 Dist | |
|
|
|---------|----------------|----------------|------------------|------------------|------------------| |
|
|
| Common Voice 17 (en) | 94.13% | 98.17% | 1.05% | 0.5456 | 1.3617 | |
|
|
| VoxCeleb2 (en) | 14.21% | 22.87% | 18.20% | 0.8152 | 1.1514 | |
|
|
|
|
|
- **Strengths**: Nails speaker identification in clean audio (Common Voice), with high accuracy and low EER. |
|
|
- **Weaknesses**: Struggles with noisy audio (VoxCeleb2) due to limited training on real-world noise. Top-5 accuracy (~23%) is still way better than random (~0.1%). |
|
|
|
|
|
## Limitations |
|
|
- Performs best on clean, studio-quality audio. Noisy environments (e.g., street interviews) reduce accuracy. |
|
|
- Trained only on English speech, so performance on other languages is untested. |
|
|
- Self-supervised loss (NT-Xent) didn’t boost performance as expected, possibly due to augmentations not matching real-world noise. |
|
|
|
|
|
## Future Improvements |
|
|
- Add augmentations for real-world noise (e.g., city sounds, background voices). |
|
|
- Train on more diverse, noisy datasets to improve robustness. |
|
|
|
|
|
## Citation |
|
|
Inspired by [Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings](https://arxiv.org/html/2503.10446). |