File size: 3,387 Bytes
7e98d38 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
tags:
- pyannote
- pyannote-audio
- pyannote-audio-model
- audio
- voice
- speech
- speaker
- speaker-diarization
- speaker-separation
- speech-separation
license: mit
inference: false
extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open-source, we will occasionnally email you about premium models and paid services around pyannote."
extra_gated_fields:
Company/university: text
Website: text
---
Using this open-source model in production?
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
# 🎹 ToTaToNet / joint speaker diarization and speech separation
This model ingests 5 seconds of mono audio sampled at 16 kHz and outputs speaker diarization AND speech separation for up to 3 speakers.

It has been trained by [Joonas Kalda](https://www.linkedin.com/in/joonas-kalda-996499133) with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.3.0` using the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) dataset (single distant microphone, SDM). These [paper](https://www.isca-archive.org/odyssey_2024/kalda24_odyssey.html) and [companion repository](https://github.com/joonaskalda/PixIT) describe the approach in more details.
## Requirements
1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.3.0` with `pip install pyannote.audio[separation]==3.3.0`
2. Accept [`pyannote/separation-ami-1.0`](https://hf.co/pyannote/separation-ami-1.0) user conditions
3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
```python
from pyannote.audio import Model
model = Model.from_pretrained(
"pyannote/separation-ami-1.0",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
```
## Usage
```python
# model ingests 5s of mono audio sampled at 16kHz...
duration = 5.0
num_channels = 1
sample_rate = 16000
waveforms = torch.randn(batch_size, num_channels, duration * sample_rate)
waveforms.shape
# (batch_size, num_channels = 1, num_samples = 80000)
# ... and outputs both speaker diarization and separation
with torch.inference_mode():
diarization, sources = model(waveform)
diarization.shape
# (batch_size, num_frames = 624, max_num_speakers = 3)
# with values between 0 (speaker inactive) and 1 (speaker active)
sources.shape
# (batch_size, num_samples = 80000, max_num_speakers = 3)
```
## Limitations
This model cannot be used to perform speaker diarization and speech separation of full recordings on its own (it only processes 5s chunks): see [pyannote/speech-separation-ami-1.0](https://hf.co/pyannote/speaker-separation-ami-1.0) pipeline that uses an additional speaker embedding model to do that.
## Citations
```bibtex
@inproceedings{Kalda24,
author={Joonas Kalda and Clément Pagés and Ricard Marxer and Tanel Alumäe and Hervé Bredin},
title={{PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings}},
year=2024,
booktitle={Proc. Odyssey 2024},
}
```
```bibtex
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
```
|