pyannote
/

separation-ami-1.0

pyannote-audio-model

speaker-diarization

speaker-separation

speech-separation

Model card Files Files and versions

separation-ami-1.0 / README.md

hbredin's picture

feat: update link to paper

4d38e95 verified over 1 year ago

|

3.39 kB

	---
	tags:
	- pyannote
	- pyannote-audio
	- pyannote-audio-model
	- audio
	- voice
	- speech
	- speaker
	- speaker-diarization
	- speaker-separation
	- speech-separation
	license: mit
	inference: false
	extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open-source, we will occasionnally email you about premium models and paid services around pyannote."
	extra_gated_fields:
	Company/university: text
	Website: text
	---

	Using this open-source model in production?
	Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.

	# 🎹 ToTaToNet / joint speaker diarization and speech separation

	This model ingests 5 seconds of mono audio sampled at 16 kHz and outputs speaker diarization AND speech separation for up to 3 speakers.

	![Example](model.png)

	It has been trained by [Joonas Kalda](https://www.linkedin.com/in/joonas-kalda-996499133) with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.3.0` using the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) dataset (single distant microphone, SDM). These [paper](https://www.isca-archive.org/odyssey_2024/kalda24_odyssey.html) and [companion repository](https://github.com/joonaskalda/PixIT) describe the approach in more details.

	## Requirements

	1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.3.0` with `pip install pyannote.audio[separation]==3.3.0`
	2. Accept [`pyannote/separation-ami-1.0`](https://hf.co/pyannote/separation-ami-1.0) user conditions
	3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).

	```python
	from pyannote.audio import Model
	model = Model.from_pretrained(
	"pyannote/separation-ami-1.0",
	use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
	```

	## Usage

	```python
	# model ingests 5s of mono audio sampled at 16kHz...
	duration = 5.0
	num_channels = 1
	sample_rate = 16000

	waveforms = torch.randn(batch_size, num_channels, duration * sample_rate)
	waveforms.shape
	# (batch_size, num_channels = 1, num_samples = 80000)

	# ... and outputs both speaker diarization and separation
	with torch.inference_mode():
	diarization, sources = model(waveform)

	diarization.shape
	# (batch_size, num_frames = 624, max_num_speakers = 3)
	# with values between 0 (speaker inactive) and 1 (speaker active)

	sources.shape
	# (batch_size, num_samples = 80000, max_num_speakers = 3)
	```

	## Limitations

	This model cannot be used to perform speaker diarization and speech separation of full recordings on its own (it only processes 5s chunks): see [pyannote/speech-separation-ami-1.0](https://hf.co/pyannote/speaker-separation-ami-1.0) pipeline that uses an additional speaker embedding model to do that.

	## Citations

	```bibtex
	@inproceedings{Kalda24,
	author={Joonas Kalda and Clément Pagés and Ricard Marxer and Tanel Alumäe and Hervé Bredin},
	title={{PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings}},
	year=2024,
	booktitle={Proc. Odyssey 2024},
	}
	```

	```bibtex
	@inproceedings{Bredin23,
	author={Hervé Bredin},
	title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
	year=2023,
	booktitle={Proc. INTERSPEECH 2023},
	}
	```