|
|
--- |
|
|
tags: |
|
|
- pyannote |
|
|
- pyannote-audio |
|
|
- pyannote-audio-model |
|
|
- audio |
|
|
- voice |
|
|
- speech |
|
|
- speaker |
|
|
- speaker-diarization |
|
|
- speaker-separation |
|
|
- speech-separation |
|
|
license: mit |
|
|
inference: false |
|
|
extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open-source, we will occasionnally email you about premium models and paid services around pyannote." |
|
|
extra_gated_fields: |
|
|
Company/university: text |
|
|
Website: text |
|
|
--- |
|
|
|
|
|
Using this open-source model in production? |
|
|
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options. |
|
|
|
|
|
# 🎹 ToTaToNet / joint speaker diarization and speech separation |
|
|
|
|
|
This model ingests 5 seconds of mono audio sampled at 16 kHz and outputs speaker diarization AND speech separation for up to 3 speakers. |
|
|
|
|
|
 |
|
|
|
|
|
It has been trained by [Joonas Kalda](https://www.linkedin.com/in/joonas-kalda-996499133) with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.3.0` using the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) dataset (single distant microphone, SDM). These [paper](https://www.isca-archive.org/odyssey_2024/kalda24_odyssey.html) and [companion repository](https://github.com/joonaskalda/PixIT) describe the approach in more details. |
|
|
|
|
|
## Requirements |
|
|
|
|
|
1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.3.0` with `pip install pyannote.audio[separation]==3.3.0` |
|
|
2. Accept [`pyannote/separation-ami-1.0`](https://hf.co/pyannote/separation-ami-1.0) user conditions |
|
|
3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens). |
|
|
|
|
|
```python |
|
|
from pyannote.audio import Model |
|
|
model = Model.from_pretrained( |
|
|
"pyannote/separation-ami-1.0", |
|
|
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE") |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
# model ingests 5s of mono audio sampled at 16kHz... |
|
|
duration = 5.0 |
|
|
num_channels = 1 |
|
|
sample_rate = 16000 |
|
|
|
|
|
waveforms = torch.randn(batch_size, num_channels, duration * sample_rate) |
|
|
waveforms.shape |
|
|
# (batch_size, num_channels = 1, num_samples = 80000) |
|
|
|
|
|
# ... and outputs both speaker diarization and separation |
|
|
with torch.inference_mode(): |
|
|
diarization, sources = model(waveform) |
|
|
|
|
|
diarization.shape |
|
|
# (batch_size, num_frames = 624, max_num_speakers = 3) |
|
|
# with values between 0 (speaker inactive) and 1 (speaker active) |
|
|
|
|
|
sources.shape |
|
|
# (batch_size, num_samples = 80000, max_num_speakers = 3) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
This model cannot be used to perform speaker diarization and speech separation of full recordings on its own (it only processes 5s chunks): see [pyannote/speech-separation-ami-1.0](https://hf.co/pyannote/speaker-separation-ami-1.0) pipeline that uses an additional speaker embedding model to do that. |
|
|
|
|
|
## Citations |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{Kalda24, |
|
|
author={Joonas Kalda and Clément Pagés and Ricard Marxer and Tanel Alumäe and Hervé Bredin}, |
|
|
title={{PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings}}, |
|
|
year=2024, |
|
|
booktitle={Proc. Odyssey 2024}, |
|
|
} |
|
|
``` |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{Bredin23, |
|
|
author={Hervé Bredin}, |
|
|
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}}, |
|
|
year=2023, |
|
|
booktitle={Proc. INTERSPEECH 2023}, |
|
|
} |
|
|
``` |
|
|
|