Hervé BREDIN
commited on
Commit
·
256e037
1
Parent(s):
d8c7e22
doc: update README
Browse files
README.md
CHANGED
|
@@ -13,39 +13,31 @@ tags:
|
|
| 13 |
- overlapped-speech-detection
|
| 14 |
- automatic-speech-recognition
|
| 15 |
license: cc-by-4.0
|
| 16 |
-
extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote
|
| 17 |
extra_gated_fields:
|
| 18 |
Company/university: text
|
| 19 |
Website: text
|
| 20 |
---
|
| 21 |
|
| 22 |
-
|
| 23 |
-
<a href="https://pyannote.ai/" target="blank"><img src="https://avatars.githubusercontent.com/u/162698670" width="64" /></a>
|
| 24 |
-
</p>
|
| 25 |
-
|
| 26 |
-
<div align="center">
|
| 27 |
-
<h1>Speaker diarization 4.0</h1>
|
| 28 |
-
</div>
|
| 29 |
-
|
| 30 |
-
<p align="center">
|
| 31 |
-
<img src="diarization.gif"/>
|
| 32 |
-
</p>
|
| 33 |
|
| 34 |
This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:
|
| 35 |
|
| 36 |
- stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
|
| 37 |
- audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
|
| 38 |
|
| 39 |
-
The main improvements brought by
|
| 40 |
|
| 41 |
-
-
|
| 42 |
-
-
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Setup
|
| 45 |
|
| 46 |
-
1.
|
| 47 |
-
2.
|
| 48 |
-
3.
|
| 49 |
|
| 50 |
## Quick start
|
| 51 |
|
|
@@ -53,54 +45,48 @@ The main improvements brought by 4.0 over previous version 3.1 are
|
|
| 53 |
# download the pipeline from Huggingface
|
| 54 |
from pyannote.audio import Pipeline
|
| 55 |
pipeline = Pipeline.from_pretrained(
|
| 56 |
-
"pyannote/speaker-diarization-
|
|
|
|
| 57 |
|
| 58 |
# run the pipeline locally on your computer
|
| 59 |
-
|
| 60 |
|
| 61 |
# print the predicted speaker diarization
|
| 62 |
-
for turn, _, speaker in
|
| 63 |
print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
|
| 64 |
```
|
| 65 |
|
| 66 |
## Benchmark
|
| 67 |
|
| 68 |
-
Out of the box, `
|
| 69 |
|
| 70 |
We report [diarization error rates](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
|
| 71 |
|
| 72 |
-
| Benchmark (last updated in 2025-
|
| 73 |
-
| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------| ------------------------------------------------
|
| 74 |
-
| [AISHELL-4](https://arxiv.org/abs/2104.03603) | 12.2 | 11.7 | 11.8 |
|
| 75 |
-
| [AliMeeting](https://www.openslr.org/119/) (channel 1) | 24.5 | 20.3 | 16.3 |
|
| 76 |
-
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.8 | 17.0 | 13.2 |
|
| 77 |
-
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 22.7 | 19.9 | 15.8 |
|
| 78 |
-
| [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.7 | 44.6 | 40.7 |
|
| 79 |
-
| [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 28.5 | 26.7 | 17.6 |
|
| 80 |
-
| [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 21.4 | 20.2 | 15.7 |
|
| 81 |
-
| [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 51.2 | 46.8 | 44.7 |
|
| 82 |
-
| [MSDWild](https://github.com/X-LANCE/MSDWILD) | 25.4 | 22.8 | 17.9 |
|
| 83 |
-
| [RAMC](https://www.openslr.org/123/) | 22.2 | 20.8 | 10.6 |
|
| 84 |
-
| [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 7.9 | 8.9 | 7.3 |
|
| 85 |
-
| [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.2 | 9.0 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM), ~1h files | 31s per hour of audio | 14s per hour of audio | 2.2x faster
|
| 92 |
-
| [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)), ~5min files | 37s per hour of audio | 14s per hour of audio | 2.6x faster
|
| 93 |
-
|
| 94 |
-
`pyannoteAI` premium models are even better (and also 2x faster). `labs` model is currently in private beta.
|
| 95 |
-
|
| 96 |
-
1. Create pyannoteAI API key at [`dashboard.pyannote.ai`](https://dashboard.pyannote.ai)
|
| 97 |
-
2. Enjoy 150 hours of free credits by changing one single line of code!
|
| 98 |
|
| 99 |
```diff
|
| 100 |
from pyannote.audio import Pipeline
|
| 101 |
pipeline = Pipeline.from_pretrained(
|
| 102 |
-
- 'pyannote/speaker-diarization-
|
| 103 |
-
+ '
|
| 104 |
diarization = pipeline("audio.wav") # runs on pyannoteAI servers
|
| 105 |
```
|
| 106 |
|
|
@@ -120,7 +106,7 @@ Pre-loading audio files in memory may result in faster processing:
|
|
| 120 |
|
| 121 |
```python
|
| 122 |
waveform, sample_rate = torchaudio.load("audio.wav")
|
| 123 |
-
|
| 124 |
```
|
| 125 |
|
| 126 |
## Monitoring progress
|
|
@@ -130,7 +116,7 @@ Hooks are available to monitor the progress of the pipeline:
|
|
| 130 |
```python
|
| 131 |
from pyannote.audio.pipelines.utils.hook import ProgressHook
|
| 132 |
with ProgressHook() as hook:
|
| 133 |
-
|
| 134 |
```
|
| 135 |
|
| 136 |
## Controlling the number of speakers
|
|
@@ -138,15 +124,21 @@ with ProgressHook() as hook:
|
|
| 138 |
In case the number of speakers is known in advance, one can use the `num_speakers` option:
|
| 139 |
|
| 140 |
```python
|
| 141 |
-
|
| 142 |
```
|
| 143 |
|
| 144 |
One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
|
| 145 |
|
| 146 |
```python
|
| 147 |
-
|
| 148 |
```
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
## Offline use
|
| 151 |
|
| 152 |
1. In the terminal, copy the pipeline on disk:
|
|
@@ -160,7 +152,7 @@ mkdir /path/to/directory
|
|
| 160 |
|
| 161 |
# when prompted for a password, use an access token with write permissions.
|
| 162 |
# generate one from your settings: https://huggingface.co/settings/tokens
|
| 163 |
-
git clone https://hf.co/pyannote/speaker-diarization-
|
| 164 |
```
|
| 165 |
|
| 166 |
2. In Python, use the pipeline without internet connection:
|
|
@@ -168,10 +160,10 @@ git clone https://hf.co/pyannote/speaker-diarization-4.0 /path/to/directory/pyan
|
|
| 168 |
```python
|
| 169 |
# load pipeline from disk (works without internet connection)
|
| 170 |
from pyannote.audio import Pipeline
|
| 171 |
-
pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-
|
| 172 |
|
| 173 |
# run the pipeline locally on your computer
|
| 174 |
-
|
| 175 |
```
|
| 176 |
|
| 177 |
## Citations
|
|
|
|
| 13 |
- overlapped-speech-detection
|
| 14 |
- automatic-speech-recognition
|
| 15 |
license: cc-by-4.0
|
| 16 |
+
extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote user base and help maintainers improve it further. Though this pipeline uses CC-BY-4.0 license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote."
|
| 17 |
extra_gated_fields:
|
| 18 |
Company/university: text
|
| 19 |
Website: text
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# `Community-1` speaker diarization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:
|
| 25 |
|
| 26 |
- stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
|
| 27 |
- audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
|
| 28 |
|
| 29 |
+
The main improvements brought by `Community-1` are:
|
| 30 |
|
| 31 |
+
- [improved](#benchmark) speaker assignment and counting
|
| 32 |
+
- simpler reconciliation with transcription timestamps with [*exclusive*](#exclusive-speaker-diarization) speaker diarization
|
| 33 |
+
- easy [offline use](#offline-use) (i.e. without internet connection)
|
| 34 |
+
- (optionally) [hosted](https://hf.co/pyannote/speaker-diarization-community-1-cloud) on pyannoteAI cloud
|
| 35 |
|
| 36 |
## Setup
|
| 37 |
|
| 38 |
+
1. `pip install pyannote.audio`
|
| 39 |
+
2. Accept user conditions
|
| 40 |
+
3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
|
| 41 |
|
| 42 |
## Quick start
|
| 43 |
|
|
|
|
| 45 |
# download the pipeline from Huggingface
|
| 46 |
from pyannote.audio import Pipeline
|
| 47 |
pipeline = Pipeline.from_pretrained(
|
| 48 |
+
"pyannote/speaker-diarization-community-1",
|
| 49 |
+
token="{huggingface-token}")
|
| 50 |
|
| 51 |
# run the pipeline locally on your computer
|
| 52 |
+
output = pipeline("audio.wav")
|
| 53 |
|
| 54 |
# print the predicted speaker diarization
|
| 55 |
+
for turn, _, speaker in output.speaker_diarization.itertracks(yield_label=True):
|
| 56 |
print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
|
| 57 |
```
|
| 58 |
|
| 59 |
## Benchmark
|
| 60 |
|
| 61 |
+
Out of the box, `Community-1` is much better than `speaker-diarization-3.1`.
|
| 62 |
|
| 63 |
We report [diarization error rates](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
|
| 64 |
|
| 65 |
+
| Benchmark (last updated in 2025-09) | <a href="https://hf.co/pyannote/speaker-diarization-3.1">3.1</a> | <a href="https://hf.co/pyannote/speaker-diarization-community-1">Community-1</a> | <a href="https://docs.pyannote.ai">Precision-2</a>
|
| 66 |
+
| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------| ------------------------------------------------
|
| 67 |
+
| [AISHELL-4](https://arxiv.org/abs/2104.03603) | 12.2 | 11.7 | 11.8 |
|
| 68 |
+
| [AliMeeting](https://www.openslr.org/119/) (channel 1) | 24.5 | 20.3 | 16.3 |
|
| 69 |
+
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.8 | 17.0 | 13.2 |
|
| 70 |
+
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 22.7 | 19.9 | 15.8 |
|
| 71 |
+
| [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.7 | 44.6 | 40.7 |
|
| 72 |
+
| [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 28.5 | 26.7 | 17.6 |
|
| 73 |
+
| [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 21.4 | 20.2 | 15.7 |
|
| 74 |
+
| [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 51.2 | 46.8 | 44.7 |
|
| 75 |
+
| [MSDWild](https://github.com/X-LANCE/MSDWILD) | 25.4 | 22.8 | 17.9 |
|
| 76 |
+
| [RAMC](https://www.openslr.org/123/) | 22.2 | 20.8 | 10.6 |
|
| 77 |
+
| [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 7.9 | 8.9 | 7.3 |
|
| 78 |
+
| [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.2 | 9.0 |
|
| 79 |
+
|
| 80 |
+
`Precision-2` model is even better and can be tested like this:
|
| 81 |
+
|
| 82 |
+
1. Create an API key on [pyannoteAI dashboard]((https://dashboard.pyannote.ai)) (free credits included)
|
| 83 |
+
2. Change one line of code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
```diff
|
| 86 |
from pyannote.audio import Pipeline
|
| 87 |
pipeline = Pipeline.from_pretrained(
|
| 88 |
+
- 'pyannote/speaker-diarization-community-1', token="{huggingface-token}")
|
| 89 |
+
+ 'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}")
|
| 90 |
diarization = pipeline("audio.wav") # runs on pyannoteAI servers
|
| 91 |
```
|
| 92 |
|
|
|
|
| 106 |
|
| 107 |
```python
|
| 108 |
waveform, sample_rate = torchaudio.load("audio.wav")
|
| 109 |
+
output = pipeline({"waveform": waveform, "sample_rate": sample_rate})
|
| 110 |
```
|
| 111 |
|
| 112 |
## Monitoring progress
|
|
|
|
| 116 |
```python
|
| 117 |
from pyannote.audio.pipelines.utils.hook import ProgressHook
|
| 118 |
with ProgressHook() as hook:
|
| 119 |
+
output = pipeline("audio.wav", hook=hook)
|
| 120 |
```
|
| 121 |
|
| 122 |
## Controlling the number of speakers
|
|
|
|
| 124 |
In case the number of speakers is known in advance, one can use the `num_speakers` option:
|
| 125 |
|
| 126 |
```python
|
| 127 |
+
output = pipeline("audio.wav", num_speakers=2)
|
| 128 |
```
|
| 129 |
|
| 130 |
One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
|
| 131 |
|
| 132 |
```python
|
| 133 |
+
output = pipeline("audio.wav", min_speakers=2, max_speakers=5)
|
| 134 |
```
|
| 135 |
|
| 136 |
+
## Exclusive speaker diarization
|
| 137 |
+
|
| 138 |
+
`Community-1` pretrained pipeline returns a new *exclusive* speaker diarization, on top of the regular speaker diarization, available as `output.exclusive_speaker_diarization`.
|
| 139 |
+
|
| 140 |
+
This is a feature which is [backported from our latest commercial model](https://www.pyannote.ai/blog/precision-2) that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.
|
| 141 |
+
|
| 142 |
## Offline use
|
| 143 |
|
| 144 |
1. In the terminal, copy the pipeline on disk:
|
|
|
|
| 152 |
|
| 153 |
# when prompted for a password, use an access token with write permissions.
|
| 154 |
# generate one from your settings: https://huggingface.co/settings/tokens
|
| 155 |
+
git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
|
| 156 |
```
|
| 157 |
|
| 158 |
2. In Python, use the pipeline without internet connection:
|
|
|
|
| 160 |
```python
|
| 161 |
# load pipeline from disk (works without internet connection)
|
| 162 |
from pyannote.audio import Pipeline
|
| 163 |
+
pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')
|
| 164 |
|
| 165 |
# run the pipeline locally on your computer
|
| 166 |
+
output = pipeline("audio.wav")
|
| 167 |
```
|
| 168 |
|
| 169 |
## Citations
|