Update README.md

1af6dda verified 7 months ago

4.18 kB

	---
	license: mit
	datasets:
	- ProgramComputer/voxceleb
	- mozilla-foundation/common_voice_17_0
	language:
	- en
	base_model:
	- openai/whisper-base
	tags:
	- embedding
	- audio
	- speech
	---

	## Model Description
	Hey there! This is `voice-embedder-base`, a model that generates speaker embeddings—compact vectors that capture unique vocal characteristics for tasks like speaker verification, clustering, or voice similarity retrieval. It’s built by fine-tuning the `openai/whisper-base` encoder with a contrastive learning approach, using a mix of triplet loss and NT-Xent loss to make embeddings robust and speaker-discriminative. Trained on English speech from Common Voice 17 and VoxCeleb2 datasets, it shines in clean studio settings but holds its own in noisier environments too.

	- Developed by: John Backsund
	- Model Type: Speaker Embedding
	- Base Model: `openai/whisper-base` encoder
	- Embedding Size: 256
	- Training Data: Common Voice 17 (en) Train split derived dataset
	- License: MIT

	## Intended Use
	This model is great for:
	- Speaker Clustering: Grouping audio samples by speaker (e.g., for diarization).
	- Speaker Verification: Checking if two audio clips are from the same speaker.
	- Voice Retrieval: Finding similar voices in a dataset.

	It’s best for clean audio (like studio recordings) but can handle some noise, though performance drops in very noisy settings (e.g., crowded interviews).

	## How to Use
	To use this model, you’ll need the `voice-finder` library, which includes the custom `VoiceEmbedder` and `VoiceEmbedderFeatureExtractor` classes. Install it from GitHub, then load the model and processor with `transformers`.

	```bash
	pip install git+https://github.com/johBac97/voice-finder.git
	```

	```python
	from transformers import AutoProcessor, AutoModel
	processor = AutoProcessor.from_pretrained("johbac/voice-embedder-base")
	model = AutoModel.from_pretrained("johbac/voice-embedder-base")

	# Example: Process audio and get embeddings
	audio = [...] # List of audio arrays (16kHz)
	features = processor(audio, sampling_rate=16000, return_tensors="pt")
	embeddings = model(**features) # Shape: [batch_size, 256]
	```

	## Training Details
	- Architecture: Whisper encoder + MLP projector (512 → 256 dims).
	- Loss: Combined hard-mining triplet loss (supervised) + NT-Xent loss
	- Augmentations: Gaussian noise (for NT-Xent).
	- Validation Datasets:
	- Common Voice 17 (en) Derived :1,257 speakers (train); 6,270 samples, 2,090 speakers (dev).
	- VoxCeleb2 (en) Derived: 12,756 noisy samples, 4,252 speakers (dev, filtered for English).
	- Preprocessing: Audio resampled to 16kHz, processed with Whisper Feature Extractor, stored in Zarr archives.

	See the full [report](https://github.com/johBac97/voice-finder/blob/master/report.md) for details.

	## Performance
	Here’s how the model performs on the dev sets:

	\| Dataset \| Top-1 Accuracy \| Top-5 Accuracy \| Equal Error Rate \| Avg Same L2 Dist \| Avg Diff L2 Dist \|
	\|---------\|----------------\|----------------\|------------------\|------------------\|------------------\|
	\| Common Voice 17 (en) \| 94.13% \| 98.17% \| 1.05% \| 0.5456 \| 1.3617 \|
	\| VoxCeleb2 (en) \| 14.21% \| 22.87% \| 18.20% \| 0.8152 \| 1.1514 \|

	- Strengths: Nails speaker identification in clean audio (Common Voice), with high accuracy and low EER.
	- Weaknesses: Struggles with noisy audio (VoxCeleb2) due to limited training on real-world noise. Top-5 accuracy (~23%) is still way better than random (~0.1%).

	## Limitations
	- Performs best on clean, studio-quality audio. Noisy environments (e.g., street interviews) reduce accuracy.
	- Trained only on English speech, so performance on other languages is untested.
	- Self-supervised loss (NT-Xent) didn’t boost performance as expected, possibly due to augmentations not matching real-world noise.

	## Future Improvements
	- Add augmentations for real-world noise (e.g., city sounds, background voices).
	- Train on more diverse, noisy datasets to improve robustness.

	## Citation
	Inspired by [Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings](https://arxiv.org/html/2503.10446).