Hervé BREDIN commited on
Commit
256e037
·
1 Parent(s): d8c7e22

doc: update README

Browse files
Files changed (1) hide show
  1. README.md +49 -57
README.md CHANGED
@@ -13,39 +13,31 @@ tags:
13
  - overlapped-speech-detection
14
  - automatic-speech-recognition
15
  license: cc-by-4.0
16
- extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio user base and help maintainers improve it further. Though this pipeline uses CC-BY-4.0 license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote."
17
  extra_gated_fields:
18
  Company/university: text
19
  Website: text
20
  ---
21
 
22
- <p align="center">
23
- <a href="https://pyannote.ai/" target="blank"><img src="https://avatars.githubusercontent.com/u/162698670" width="64" /></a>
24
- </p>
25
-
26
- <div align="center">
27
- <h1>Speaker diarization 4.0</h1>
28
- </div>
29
-
30
- <p align="center">
31
- <img src="diarization.gif"/>
32
- </p>
33
 
34
  This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:
35
 
36
  - stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
37
  - audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
38
 
39
- The main improvements brought by 4.0 over previous version 3.1 are
40
 
41
- - much [better](#benchmark) speaker counting and assignment
42
- - much easier [offline use](#offline-use) (i.e. without internet connection)
 
 
43
 
44
  ## Setup
45
 
46
- 1. Accept user conditions
47
- 2. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
48
- 3. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `>=4.0.0` with `pip install pyannote.audio`
49
 
50
  ## Quick start
51
 
@@ -53,54 +45,48 @@ The main improvements brought by 4.0 over previous version 3.1 are
53
  # download the pipeline from Huggingface
54
  from pyannote.audio import Pipeline
55
  pipeline = Pipeline.from_pretrained(
56
- "pyannote/speaker-diarization-4.0", token="{huggingface-token}")
 
57
 
58
  # run the pipeline locally on your computer
59
- diarization = pipeline("audio.wav")
60
 
61
  # print the predicted speaker diarization
62
- for turn, _, speaker in diarization.itertracks(yield_label=True):
63
  print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
64
  ```
65
 
66
  ## Benchmark
67
 
68
- Out of the box, `pyannote.audio` speaker diarization pipeline v4.0 is expected to be much better than v3.1.
69
 
70
  We report [diarization error rates](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
71
 
72
- | Benchmark (last updated in 2025-08) | <a href="https://hf.co/pyannote/speaker-diarization-3.1"><img src="https://avatars.githubusercontent.com/u/7559051" width="32" /><br/>v3.1</a> | <a href="https://hf.co/pyannote/speaker-diarization-4.0"><img src="https://avatars.githubusercontent.com/u/7559051" width="32" /><br/> v4.0</a> | <a href="https://docs.pyannote.ai"><img src="https://avatars.githubusercontent.com/u/162698670" width="32" /><br/>API</a> | <a href="https://docs.pyannote.ai"><img src="https://avatars.githubusercontent.com/u/162698670" width="32" /><br/>labs</a> |
73
- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------| ------------------------------------------------ | --- |
74
- | [AISHELL-4](https://arxiv.org/abs/2104.03603) | 12.2 | 11.7 | 11.8 | 11.4 |
75
- | [AliMeeting](https://www.openslr.org/119/) (channel 1) | 24.5 | 20.3 | 16.3 | 15.2 |
76
- | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.8 | 17.0 | 13.2 | 12.9 |
77
- | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 22.7 | 19.9 | 15.8 | 15.6 |
78
- | [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.7 | 44.6 | 40.7 | 37.1 |
79
- | [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 28.5 | 26.7 | 17.6 | 16.6 |
80
- | [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 21.4 | 20.2 | 15.7 | 14.7 |
81
- | [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 51.2 | 46.8 | 44.7 | 39.0 |
82
- | [MSDWild](https://github.com/X-LANCE/MSDWILD) | 25.4 | 22.8 | 17.9 | 17.3 |
83
- | [RAMC](https://www.openslr.org/123/) | 22.2 | 20.8 | 10.6 | 10.5 |
84
- | [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 7.9 | 8.9 | 7.3 | 7.4 |
85
- | [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.2 | 9.0 | 8.5 |
86
-
87
- We also report processing speed on a NVIDIA H100 80GB HBM3:
88
-
89
- | Benchmark (last updated in 2025-08) | <img src="https://avatars.githubusercontent.com/u/7559051" width="32" /> | <a href="https://docs.pyannote.ai"><img src="https://avatars.githubusercontent.com/u/162698670" width="32" /></a> | Speed up
90
- | -------------- | ----------- | ----------- | ------ |
91
- | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM), ~1h files | 31s per hour of audio | 14s per hour of audio | 2.2x faster
92
- | [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)), ~5min files | 37s per hour of audio | 14s per hour of audio | 2.6x faster
93
-
94
- `pyannoteAI` premium models are even better (and also 2x faster). `labs` model is currently in private beta.
95
-
96
- 1. Create pyannoteAI API key at [`dashboard.pyannote.ai`](https://dashboard.pyannote.ai)
97
- 2. Enjoy 150 hours of free credits by changing one single line of code!
98
 
99
  ```diff
100
  from pyannote.audio import Pipeline
101
  pipeline = Pipeline.from_pretrained(
102
- - 'pyannote/speaker-diarization-4.0', token="{huggingface-token}")
103
- + 'pyannoteAI/speaker-diarization-precision', token="{pyannoteAI-api-key}")
104
  diarization = pipeline("audio.wav") # runs on pyannoteAI servers
105
  ```
106
 
@@ -120,7 +106,7 @@ Pre-loading audio files in memory may result in faster processing:
120
 
121
  ```python
122
  waveform, sample_rate = torchaudio.load("audio.wav")
123
- diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
124
  ```
125
 
126
  ## Monitoring progress
@@ -130,7 +116,7 @@ Hooks are available to monitor the progress of the pipeline:
130
  ```python
131
  from pyannote.audio.pipelines.utils.hook import ProgressHook
132
  with ProgressHook() as hook:
133
- diarization = pipeline("audio.wav", hook=hook)
134
  ```
135
 
136
  ## Controlling the number of speakers
@@ -138,15 +124,21 @@ with ProgressHook() as hook:
138
  In case the number of speakers is known in advance, one can use the `num_speakers` option:
139
 
140
  ```python
141
- diarization = pipeline("audio.wav", num_speakers=2)
142
  ```
143
 
144
  One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
145
 
146
  ```python
147
- diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
148
  ```
149
 
 
 
 
 
 
 
150
  ## Offline use
151
 
152
  1. In the terminal, copy the pipeline on disk:
@@ -160,7 +152,7 @@ mkdir /path/to/directory
160
 
161
  # when prompted for a password, use an access token with write permissions.
162
  # generate one from your settings: https://huggingface.co/settings/tokens
163
- git clone https://hf.co/pyannote/speaker-diarization-4.0 /path/to/directory/pyannote-speaker-diarization-4.0
164
  ```
165
 
166
  2. In Python, use the pipeline without internet connection:
@@ -168,10 +160,10 @@ git clone https://hf.co/pyannote/speaker-diarization-4.0 /path/to/directory/pyan
168
  ```python
169
  # load pipeline from disk (works without internet connection)
170
  from pyannote.audio import Pipeline
171
- pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-4.0')
172
 
173
  # run the pipeline locally on your computer
174
- diarization = pipeline("audio.wav")
175
  ```
176
 
177
  ## Citations
 
13
  - overlapped-speech-detection
14
  - automatic-speech-recognition
15
  license: cc-by-4.0
16
+ extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote user base and help maintainers improve it further. Though this pipeline uses CC-BY-4.0 license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote."
17
  extra_gated_fields:
18
  Company/university: text
19
  Website: text
20
  ---
21
 
22
+ # `Community-1` speaker diarization
 
 
 
 
 
 
 
 
 
 
23
 
24
  This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:
25
 
26
  - stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
27
  - audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
28
 
29
+ The main improvements brought by `Community-1` are:
30
 
31
+ - [improved](#benchmark) speaker assignment and counting
32
+ - simpler reconciliation with transcription timestamps with [*exclusive*](#exclusive-speaker-diarization) speaker diarization
33
+ - easy [offline use](#offline-use) (i.e. without internet connection)
34
+ - (optionally) [hosted](https://hf.co/pyannote/speaker-diarization-community-1-cloud) on pyannoteAI cloud
35
 
36
  ## Setup
37
 
38
+ 1. `pip install pyannote.audio`
39
+ 2. Accept user conditions
40
+ 3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
41
 
42
  ## Quick start
43
 
 
45
  # download the pipeline from Huggingface
46
  from pyannote.audio import Pipeline
47
  pipeline = Pipeline.from_pretrained(
48
+ "pyannote/speaker-diarization-community-1",
49
+ token="{huggingface-token}")
50
 
51
  # run the pipeline locally on your computer
52
+ output = pipeline("audio.wav")
53
 
54
  # print the predicted speaker diarization
55
+ for turn, _, speaker in output.speaker_diarization.itertracks(yield_label=True):
56
  print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
57
  ```
58
 
59
  ## Benchmark
60
 
61
+ Out of the box, `Community-1` is much better than `speaker-diarization-3.1`.
62
 
63
  We report [diarization error rates](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
64
 
65
+ | Benchmark (last updated in 2025-09) | <a href="https://hf.co/pyannote/speaker-diarization-3.1">3.1</a> | <a href="https://hf.co/pyannote/speaker-diarization-community-1">Community-1</a> | <a href="https://docs.pyannote.ai">Precision-2</a>
66
+ | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------| ------------------------------------------------
67
+ | [AISHELL-4](https://arxiv.org/abs/2104.03603) | 12.2 | 11.7 | 11.8 |
68
+ | [AliMeeting](https://www.openslr.org/119/) (channel 1) | 24.5 | 20.3 | 16.3 |
69
+ | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.8 | 17.0 | 13.2 |
70
+ | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 22.7 | 19.9 | 15.8 |
71
+ | [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.7 | 44.6 | 40.7 |
72
+ | [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 28.5 | 26.7 | 17.6 |
73
+ | [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 21.4 | 20.2 | 15.7 |
74
+ | [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 51.2 | 46.8 | 44.7 |
75
+ | [MSDWild](https://github.com/X-LANCE/MSDWILD) | 25.4 | 22.8 | 17.9 |
76
+ | [RAMC](https://www.openslr.org/123/) | 22.2 | 20.8 | 10.6 |
77
+ | [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 7.9 | 8.9 | 7.3 |
78
+ | [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.2 | 9.0 |
79
+
80
+ `Precision-2` model is even better and can be tested like this:
81
+
82
+ 1. Create an API key on [pyannoteAI dashboard]((https://dashboard.pyannote.ai)) (free credits included)
83
+ 2. Change one line of code
 
 
 
 
 
 
 
84
 
85
  ```diff
86
  from pyannote.audio import Pipeline
87
  pipeline = Pipeline.from_pretrained(
88
+ - 'pyannote/speaker-diarization-community-1', token="{huggingface-token}")
89
+ + 'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}")
90
  diarization = pipeline("audio.wav") # runs on pyannoteAI servers
91
  ```
92
 
 
106
 
107
  ```python
108
  waveform, sample_rate = torchaudio.load("audio.wav")
109
+ output = pipeline({"waveform": waveform, "sample_rate": sample_rate})
110
  ```
111
 
112
  ## Monitoring progress
 
116
  ```python
117
  from pyannote.audio.pipelines.utils.hook import ProgressHook
118
  with ProgressHook() as hook:
119
+ output = pipeline("audio.wav", hook=hook)
120
  ```
121
 
122
  ## Controlling the number of speakers
 
124
  In case the number of speakers is known in advance, one can use the `num_speakers` option:
125
 
126
  ```python
127
+ output = pipeline("audio.wav", num_speakers=2)
128
  ```
129
 
130
  One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
131
 
132
  ```python
133
+ output = pipeline("audio.wav", min_speakers=2, max_speakers=5)
134
  ```
135
 
136
+ ## Exclusive speaker diarization
137
+
138
+ `Community-1` pretrained pipeline returns a new *exclusive* speaker diarization, on top of the regular speaker diarization, available as `output.exclusive_speaker_diarization`.
139
+
140
+ This is a feature which is [backported from our latest commercial model](https://www.pyannote.ai/blog/precision-2) that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.
141
+
142
  ## Offline use
143
 
144
  1. In the terminal, copy the pipeline on disk:
 
152
 
153
  # when prompted for a password, use an access token with write permissions.
154
  # generate one from your settings: https://huggingface.co/settings/tokens
155
+ git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
156
  ```
157
 
158
  2. In Python, use the pipeline without internet connection:
 
160
  ```python
161
  # load pipeline from disk (works without internet connection)
162
  from pyannote.audio import Pipeline
163
+ pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')
164
 
165
  # run the pipeline locally on your computer
166
+ output = pipeline("audio.wav")
167
  ```
168
 
169
  ## Citations