File size: 8,971 Bytes
8ecadc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: other
license_name: mixed-mit-cc-by-4-apache-2
license_link: LICENSE
language:
- en
- multilingual
library_name: onnx
tags:
- speaker-diarization
- diarization
- pyannote
- speaker-embedding
- wespeaker
- segmentation
pipeline_tag: voice-activity-detection
---

# dia-models — pyannote community-1 model bundle for the `dia` Rust crate

A single-repo distribution of every model artifact the
[`dia`](https://github.com/al8n/diarization) Rust crate needs to run
end-to-end speaker diarization with **pyannote-community-1** parity:

- The **segmentation-3.0** powerset speaker network (16 kHz audio →
  per-frame speaker activations).
- The **WeSpeaker ResNet34-LM** speaker-embedding network, in three
  forms (external-data ONNX, single-file ONNX, TorchScript).
- The **PLDA** whitening + LDA weights from the
  [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
  pipeline, in both `.npz` (build-time) and raw little-endian f64
  `.bin` (runtime) form.

`dia` already embeds the segmentation model and the PLDA weights into
the compiled binary via `include_bytes!`; the **WeSpeaker** ONNX is
the only artifact callers must download separately. This repo lets
callers grab any individual model — or the whole bundle — without
spelunking through the upstream pyannote / WeSpeaker repos.

> **Attribution: this is a redistribution, not new model training.**
> All weights come from upstream pyannote / WeSpeaker / BUT Speech@FIT.
> The licenses below MUST be preserved by anyone redistributing.

## Files

| File | Size | Format | License |
|---|---:|---|---|
| `segmentation-3.0.onnx` | 5.99 MiB | ONNX (single file) | MIT |
| `wespeaker_resnet34_lm.onnx` | 256 KiB | ONNX header (external data) | Apache-2.0 |
| `wespeaker_resnet34_lm.onnx.data` | 25.3 MiB | external-data weights | Apache-2.0 |
| `wespeaker_resnet34_lm_packed.onnx` | 25.5 MiB | ONNX (single file, repacked) | Apache-2.0 |
| `wespeaker_resnet34_lm.pt` | 25.6 MiB | TorchScript | Apache-2.0 |
| `plda/eigenvectors_desc.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 |
| `plda/lda.bin` | 256 KiB | f64 (256×128 row-major) | CC-BY-4.0 |
| `plda/mean1.bin` | 2 KiB | f64 (256,) | CC-BY-4.0 |
| `plda/mean2.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/mu.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/phi_desc.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/psi.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/tr.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 |
| `plda/plda.npz` | 131 KiB | numpy (`mu`, `tr`, `psi`) | CC-BY-4.0 |
| `plda/xvec_transform.npz` | 131 KiB | numpy (`mean1`, `mean2`, `lda`) | CC-BY-4.0 |

## Which file do I want?

### Segmentation
Use `segmentation-3.0.onnx`. It feeds `dia::segment::SegmentModel`
(or any pyannote-segmentation-compatible runtime). Single file, no
external data, works on every ORT execution provider.

### Embedding (WeSpeaker)
Three forms, same weights, pick by use case:

- **`wespeaker_resnet34_lm.onnx` + `wespeaker_resnet34_lm.onnx.data`**
  — the default ONNX layout. Loads on CPU / TensorRT / CUDA / OpenVINO
  / DirectML. The `.onnx` and `.onnx.data` files MUST sit next to
  each other on disk; ORT resolves the external pointer by relative
  path.
- **`wespeaker_resnet34_lm_packed.onnx`** — same model with all
  weights inlined into one file. Use this if you want a single-file
  artifact, or if the runtime is **CoreML** (Apple Silicon — Apple's
  graph optimizer chokes on external initializers and reports
  `model_path must not be empty`; the packed form sidesteps it).
  Otherwise functionally identical.
- **`wespeaker_resnet34_lm.pt`** — TorchScript export for the
  `tch` backend. Bit-exact to upstream PyTorch on hard cases (heavy-
  overlap fixtures where the ONNX→ORT path can drift by O(1) per
  element). Pulls in libtorch (~600 MB shared library).

### PLDA
The eight `.bin` files are the runtime data — raw little-endian f64
blobs that `dia::plda` embeds via `include_bytes!`. The two `.npz`
files are the build-time sources (`xvec_transform.npz` exposes
`mean1` / `mean2` / `lda`; `plda.npz` exposes `mu` / `tr` /
`psi`); they are mirrored from the upstream pyannote-community-1
snapshot for traceability and so the `.bin` extraction can be
re-run via `scripts/extract-plda-blobs.sh` in the dia repo.

`eigenvectors_desc.bin` and `phi_desc.bin` are scipy-derived
eigenvectors of the PLDA generalized eigenproblem `(B, W)` — pinned
to avoid LAPACK eigenvector-sign indeterminism (which produced a
38% DER divergence on three-speaker fixtures when nalgebra and
scipy disagreed on 67 of 128 column signs). See
[`models/plda/SOURCE.md`](https://github.com/al8n/diarization/blob/main/models/plda/SOURCE.md)
in the dia repo for the regeneration procedure.

## Provenance

### segmentation-3.0.onnx
- **Upstream:** [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0)
- **Original layout:** `pytorch_model.onnx` in the upstream HF repo.
- **License:** MIT — Copyright (c) 2023 CNRS
- **Author:** Hervé Bredin (CNRS / IRIT), pyannote.audio author and
  lead trainer.
- **SHA-256:** `057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25`

### wespeaker_resnet34_lm.onnx (+ .data) / .pt / _packed.onnx
- **Upstream model architecture:** WeSpeaker ResNet34 with
  large-margin (LM) angular fine-tuning, trained on VoxCeleb-2.
- **Upstream sources:**
  - [WeSpeaker project](https://github.com/wenet-e2e/wespeaker) (Apache-2.0)
  - [`onnx-community/wespeaker_resnet34_lm`](https://huggingface.co/onnx-community/wespeaker_resnet34_lm)
    for the ONNX export.
- **License:** Apache-2.0.
- **`_packed.onnx` derivative:** produced by loading
  `wespeaker_resnet34_lm.onnx` + `.onnx.data` via the `onnx` Python
  library (`onnx.load(path, load_external_data=True)`) and re-saving
  with `save_as_external_data=False`. Same weights, no external file.

### plda/
- **Upstream:** [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
- **License:** CC-BY-4.0
- **Snapshot revision:** `3533c8cf8e369892e6b79ff1bf80f7b0286a54ee`
- **Original layout in the upstream HF repo:**
  `plda/xvec_transform.npz` and `plda/plda.npz`.
- **Attribution (per upstream `plda/README.md`):**
  PLDA model trained by [BUT Speech@FIT](https://speech.fit.vut.cz/);
  integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.

## Usage

### From `dia` (Rust)
```rust
use diarization::{
  embed::EmbedModel,
  plda::PldaTransform,
  segment::SegmentModel,
};
// Segmentation + PLDA are bundled by default — no download needed.
let mut seg = SegmentModel::bundled()?;
let plda = PldaTransform::new()?;
// WeSpeaker is BYO; download from this repo.
let mut emb = EmbedModel::from_file("wespeaker_resnet34_lm.onnx")?;
# Ok::<(), Box<dyn std::error::Error>>(())
```

### Direct download
```bash
# whole bundle
hf download FinDIT-Studio/dia-models --local-dir ./dia-models

# just the embedding model (default ONNX form)
hf download FinDIT-Studio/dia-models \
  wespeaker_resnet34_lm.onnx wespeaker_resnet34_lm.onnx.data \
  --local-dir ./models

# CoreML-friendly single-file form
hf download FinDIT-Studio/dia-models \
  wespeaker_resnet34_lm_packed.onnx --local-dir ./models
```

## Licenses

This repository **redistributes** model artifacts under three different
licenses. Each artifact retains its upstream license. By using this
bundle you agree to comply with **all three**:

- **MIT** for `segmentation-3.0.onnx` (Copyright © 2023 CNRS, Hervé Bredin).
  See `LICENSE.MIT`.
- **Apache-2.0** for the WeSpeaker artifacts. See `LICENSE.APACHE-2.0`.
- **CC-BY-4.0** for everything under `plda/`. See `LICENSE.CC-BY-4.0`.
  Required attribution: *PLDA model trained by BUT Speech@FIT;
  integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.*

The `dia` Rust crate that consumes these models is itself dual-licensed
MIT OR Apache-2.0; that licensing applies to the source code, not to the
model weights bundled here.

## Citation

If you use these weights in academic work, please cite the upstream
papers / model cards:

- **Segmentation-3.0:** Hervé Bredin, *pyannote.audio 2.1 speaker
  diarization pipeline: principle, benchmark, and recipe*, Interspeech
  2023.
- **WeSpeaker:** Wang et al., *WeSpeaker: A research and production
  oriented speaker embedding learning toolkit*, ICASSP 2023.
- **PLDA / VBx:** Landini et al., *Bayesian HMM clustering of x-vector
  sequences (VBx) in speaker diarization: theory, implementation and
  analysis on standard tasks*, Computer Speech & Language, 2022.

## Issues / questions

This repo is a **redistribution** of upstream artifacts. Please file
issues against:

- The dia Rust crate: <https://github.com/al8n/diarization/issues>
- The pyannote.audio project: <https://github.com/pyannote/pyannote-audio/issues>
- The WeSpeaker project: <https://github.com/wenet-e2e/wespeaker/issues>