--- license: mit datasets: - ProgramComputer/voxceleb - mozilla-foundation/common_voice_17_0 language: - en base_model: - openai/whisper-base tags: - embedding - audio - speech --- ## Model Description Hey there! This is `voice-embedder-base`, a model that generates speaker embeddings—compact vectors that capture unique vocal characteristics for tasks like speaker verification, clustering, or voice similarity retrieval. It’s built by fine-tuning the `openai/whisper-base` encoder with a contrastive learning approach, using a mix of triplet loss and NT-Xent loss to make embeddings robust and speaker-discriminative. Trained on English speech from Common Voice 17 and VoxCeleb2 datasets, it shines in clean studio settings but holds its own in noisier environments too. - **Developed by**: John Backsund - **Model Type**: Speaker Embedding - **Base Model**: `openai/whisper-base` encoder - **Embedding Size**: 256 - **Training Data**: Common Voice 17 (en) Train split derived dataset - **License**: MIT ## Intended Use This model is great for: - **Speaker Clustering**: Grouping audio samples by speaker (e.g., for diarization). - **Speaker Verification**: Checking if two audio clips are from the same speaker. - **Voice Retrieval**: Finding similar voices in a dataset. It’s best for clean audio (like studio recordings) but can handle some noise, though performance drops in very noisy settings (e.g., crowded interviews). ## How to Use To use this model, you’ll need the `voice-finder` library, which includes the custom `VoiceEmbedder` and `VoiceEmbedderFeatureExtractor` classes. Install it from GitHub, then load the model and processor with `transformers`. ```bash pip install git+https://github.com/johBac97/voice-finder.git ``` ```python from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained("johbac/voice-embedder-base") model = AutoModel.from_pretrained("johbac/voice-embedder-base") # Example: Process audio and get embeddings audio = [...] # List of audio arrays (16kHz) features = processor(audio, sampling_rate=16000, return_tensors="pt") embeddings = model(**features) # Shape: [batch_size, 256] ``` ## Training Details - **Architecture**: Whisper encoder + MLP projector (512 → 256 dims). - **Loss**: Combined hard-mining triplet loss (supervised) + NT-Xent loss - **Augmentations**: Gaussian noise (for NT-Xent). - **Validation Datasets**: - Common Voice 17 (en) Derived :1,257 speakers (train); 6,270 samples, 2,090 speakers (dev). - VoxCeleb2 (en) Derived: 12,756 noisy samples, 4,252 speakers (dev, filtered for English). - **Preprocessing**: Audio resampled to 16kHz, processed with Whisper Feature Extractor, stored in Zarr archives. See the full [report](https://github.com/johBac97/voice-finder/blob/master/report.md) for details. ## Performance Here’s how the model performs on the dev sets: | Dataset | Top-1 Accuracy | Top-5 Accuracy | Equal Error Rate | Avg Same L2 Dist | Avg Diff L2 Dist | |---------|----------------|----------------|------------------|------------------|------------------| | Common Voice 17 (en) | 94.13% | 98.17% | 1.05% | 0.5456 | 1.3617 | | VoxCeleb2 (en) | 14.21% | 22.87% | 18.20% | 0.8152 | 1.1514 | - **Strengths**: Nails speaker identification in clean audio (Common Voice), with high accuracy and low EER. - **Weaknesses**: Struggles with noisy audio (VoxCeleb2) due to limited training on real-world noise. Top-5 accuracy (~23%) is still way better than random (~0.1%). ## Limitations - Performs best on clean, studio-quality audio. Noisy environments (e.g., street interviews) reduce accuracy. - Trained only on English speech, so performance on other languages is untested. - Self-supervised loss (NT-Xent) didn’t boost performance as expected, possibly due to augmentations not matching real-world noise. ## Future Improvements - Add augmentations for real-world noise (e.g., city sounds, background voices). - Train on more diverse, noisy datasets to improve robustness. ## Citation Inspired by [Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings](https://arxiv.org/html/2503.10446).