File size: 8,971 Bytes
8ecadc1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | ---
license: other
license_name: mixed-mit-cc-by-4-apache-2
license_link: LICENSE
language:
- en
- multilingual
library_name: onnx
tags:
- speaker-diarization
- diarization
- pyannote
- speaker-embedding
- wespeaker
- segmentation
pipeline_tag: voice-activity-detection
---
# dia-models — pyannote community-1 model bundle for the `dia` Rust crate
A single-repo distribution of every model artifact the
[`dia`](https://github.com/al8n/diarization) Rust crate needs to run
end-to-end speaker diarization with **pyannote-community-1** parity:
- The **segmentation-3.0** powerset speaker network (16 kHz audio →
per-frame speaker activations).
- The **WeSpeaker ResNet34-LM** speaker-embedding network, in three
forms (external-data ONNX, single-file ONNX, TorchScript).
- The **PLDA** whitening + LDA weights from the
[`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
pipeline, in both `.npz` (build-time) and raw little-endian f64
`.bin` (runtime) form.
`dia` already embeds the segmentation model and the PLDA weights into
the compiled binary via `include_bytes!`; the **WeSpeaker** ONNX is
the only artifact callers must download separately. This repo lets
callers grab any individual model — or the whole bundle — without
spelunking through the upstream pyannote / WeSpeaker repos.
> **Attribution: this is a redistribution, not new model training.**
> All weights come from upstream pyannote / WeSpeaker / BUT Speech@FIT.
> The licenses below MUST be preserved by anyone redistributing.
## Files
| File | Size | Format | License |
|---|---:|---|---|
| `segmentation-3.0.onnx` | 5.99 MiB | ONNX (single file) | MIT |
| `wespeaker_resnet34_lm.onnx` | 256 KiB | ONNX header (external data) | Apache-2.0 |
| `wespeaker_resnet34_lm.onnx.data` | 25.3 MiB | external-data weights | Apache-2.0 |
| `wespeaker_resnet34_lm_packed.onnx` | 25.5 MiB | ONNX (single file, repacked) | Apache-2.0 |
| `wespeaker_resnet34_lm.pt` | 25.6 MiB | TorchScript | Apache-2.0 |
| `plda/eigenvectors_desc.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 |
| `plda/lda.bin` | 256 KiB | f64 (256×128 row-major) | CC-BY-4.0 |
| `plda/mean1.bin` | 2 KiB | f64 (256,) | CC-BY-4.0 |
| `plda/mean2.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/mu.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/phi_desc.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/psi.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/tr.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 |
| `plda/plda.npz` | 131 KiB | numpy (`mu`, `tr`, `psi`) | CC-BY-4.0 |
| `plda/xvec_transform.npz` | 131 KiB | numpy (`mean1`, `mean2`, `lda`) | CC-BY-4.0 |
## Which file do I want?
### Segmentation
Use `segmentation-3.0.onnx`. It feeds `dia::segment::SegmentModel`
(or any pyannote-segmentation-compatible runtime). Single file, no
external data, works on every ORT execution provider.
### Embedding (WeSpeaker)
Three forms, same weights, pick by use case:
- **`wespeaker_resnet34_lm.onnx` + `wespeaker_resnet34_lm.onnx.data`**
— the default ONNX layout. Loads on CPU / TensorRT / CUDA / OpenVINO
/ DirectML. The `.onnx` and `.onnx.data` files MUST sit next to
each other on disk; ORT resolves the external pointer by relative
path.
- **`wespeaker_resnet34_lm_packed.onnx`** — same model with all
weights inlined into one file. Use this if you want a single-file
artifact, or if the runtime is **CoreML** (Apple Silicon — Apple's
graph optimizer chokes on external initializers and reports
`model_path must not be empty`; the packed form sidesteps it).
Otherwise functionally identical.
- **`wespeaker_resnet34_lm.pt`** — TorchScript export for the
`tch` backend. Bit-exact to upstream PyTorch on hard cases (heavy-
overlap fixtures where the ONNX→ORT path can drift by O(1) per
element). Pulls in libtorch (~600 MB shared library).
### PLDA
The eight `.bin` files are the runtime data — raw little-endian f64
blobs that `dia::plda` embeds via `include_bytes!`. The two `.npz`
files are the build-time sources (`xvec_transform.npz` exposes
`mean1` / `mean2` / `lda`; `plda.npz` exposes `mu` / `tr` /
`psi`); they are mirrored from the upstream pyannote-community-1
snapshot for traceability and so the `.bin` extraction can be
re-run via `scripts/extract-plda-blobs.sh` in the dia repo.
`eigenvectors_desc.bin` and `phi_desc.bin` are scipy-derived
eigenvectors of the PLDA generalized eigenproblem `(B, W)` — pinned
to avoid LAPACK eigenvector-sign indeterminism (which produced a
38% DER divergence on three-speaker fixtures when nalgebra and
scipy disagreed on 67 of 128 column signs). See
[`models/plda/SOURCE.md`](https://github.com/al8n/diarization/blob/main/models/plda/SOURCE.md)
in the dia repo for the regeneration procedure.
## Provenance
### segmentation-3.0.onnx
- **Upstream:** [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0)
- **Original layout:** `pytorch_model.onnx` in the upstream HF repo.
- **License:** MIT — Copyright (c) 2023 CNRS
- **Author:** Hervé Bredin (CNRS / IRIT), pyannote.audio author and
lead trainer.
- **SHA-256:** `057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25`
### wespeaker_resnet34_lm.onnx (+ .data) / .pt / _packed.onnx
- **Upstream model architecture:** WeSpeaker ResNet34 with
large-margin (LM) angular fine-tuning, trained on VoxCeleb-2.
- **Upstream sources:**
- [WeSpeaker project](https://github.com/wenet-e2e/wespeaker) (Apache-2.0)
- [`onnx-community/wespeaker_resnet34_lm`](https://huggingface.co/onnx-community/wespeaker_resnet34_lm)
for the ONNX export.
- **License:** Apache-2.0.
- **`_packed.onnx` derivative:** produced by loading
`wespeaker_resnet34_lm.onnx` + `.onnx.data` via the `onnx` Python
library (`onnx.load(path, load_external_data=True)`) and re-saving
with `save_as_external_data=False`. Same weights, no external file.
### plda/
- **Upstream:** [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
- **License:** CC-BY-4.0
- **Snapshot revision:** `3533c8cf8e369892e6b79ff1bf80f7b0286a54ee`
- **Original layout in the upstream HF repo:**
`plda/xvec_transform.npz` and `plda/plda.npz`.
- **Attribution (per upstream `plda/README.md`):**
PLDA model trained by [BUT Speech@FIT](https://speech.fit.vut.cz/);
integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.
## Usage
### From `dia` (Rust)
```rust
use diarization::{
embed::EmbedModel,
plda::PldaTransform,
segment::SegmentModel,
};
// Segmentation + PLDA are bundled by default — no download needed.
let mut seg = SegmentModel::bundled()?;
let plda = PldaTransform::new()?;
// WeSpeaker is BYO; download from this repo.
let mut emb = EmbedModel::from_file("wespeaker_resnet34_lm.onnx")?;
# Ok::<(), Box<dyn std::error::Error>>(())
```
### Direct download
```bash
# whole bundle
hf download FinDIT-Studio/dia-models --local-dir ./dia-models
# just the embedding model (default ONNX form)
hf download FinDIT-Studio/dia-models \
wespeaker_resnet34_lm.onnx wespeaker_resnet34_lm.onnx.data \
--local-dir ./models
# CoreML-friendly single-file form
hf download FinDIT-Studio/dia-models \
wespeaker_resnet34_lm_packed.onnx --local-dir ./models
```
## Licenses
This repository **redistributes** model artifacts under three different
licenses. Each artifact retains its upstream license. By using this
bundle you agree to comply with **all three**:
- **MIT** for `segmentation-3.0.onnx` (Copyright © 2023 CNRS, Hervé Bredin).
See `LICENSE.MIT`.
- **Apache-2.0** for the WeSpeaker artifacts. See `LICENSE.APACHE-2.0`.
- **CC-BY-4.0** for everything under `plda/`. See `LICENSE.CC-BY-4.0`.
Required attribution: *PLDA model trained by BUT Speech@FIT;
integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.*
The `dia` Rust crate that consumes these models is itself dual-licensed
MIT OR Apache-2.0; that licensing applies to the source code, not to the
model weights bundled here.
## Citation
If you use these weights in academic work, please cite the upstream
papers / model cards:
- **Segmentation-3.0:** Hervé Bredin, *pyannote.audio 2.1 speaker
diarization pipeline: principle, benchmark, and recipe*, Interspeech
2023.
- **WeSpeaker:** Wang et al., *WeSpeaker: A research and production
oriented speaker embedding learning toolkit*, ICASSP 2023.
- **PLDA / VBx:** Landini et al., *Bayesian HMM clustering of x-vector
sequences (VBx) in speaker diarization: theory, implementation and
analysis on standard tasks*, Computer Speech & Language, 2022.
## Issues / questions
This repo is a **redistribution** of upstream artifacts. Please file
issues against:
- The dia Rust crate: <https://github.com/al8n/diarization/issues>
- The pyannote.audio project: <https://github.com/pyannote/pyannote-audio/issues>
- The WeSpeaker project: <https://github.com/wenet-e2e/wespeaker/issues>
|