| --- |
| license: other |
| license_name: mixed-mit-cc-by-4-apache-2 |
| license_link: LICENSE |
| language: |
| - en |
| - multilingual |
| library_name: onnx |
| tags: |
| - speaker-diarization |
| - diarization |
| - pyannote |
| - speaker-embedding |
| - wespeaker |
| - segmentation |
| pipeline_tag: voice-activity-detection |
| --- |
| |
| # dia-models — pyannote community-1 model bundle for the `dia` Rust crate |
|
|
| A single-repo distribution of every model artifact the |
| [`dia`](https://github.com/al8n/diarization) Rust crate needs to run |
| end-to-end speaker diarization with **pyannote-community-1** parity: |
|
|
| - The **segmentation-3.0** powerset speaker network (16 kHz audio → |
| per-frame speaker activations). |
| - The **WeSpeaker ResNet34-LM** speaker-embedding network, in three |
| forms (external-data ONNX, single-file ONNX, TorchScript). |
| - The **PLDA** whitening + LDA weights from the |
| [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1) |
| pipeline, in both `.npz` (build-time) and raw little-endian f64 |
| `.bin` (runtime) form. |
|
|
| `dia` already embeds the segmentation model and the PLDA weights into |
| the compiled binary via `include_bytes!`; the **WeSpeaker** ONNX is |
| the only artifact callers must download separately. This repo lets |
| callers grab any individual model — or the whole bundle — without |
| spelunking through the upstream pyannote / WeSpeaker repos. |
|
|
| > **Attribution: this is a redistribution, not new model training.** |
| > All weights come from upstream pyannote / WeSpeaker / BUT Speech@FIT. |
| > The licenses below MUST be preserved by anyone redistributing. |
|
|
| ## Files |
|
|
| | File | Size | Format | License | |
| |---|---:|---|---| |
| | `segmentation-3.0.onnx` | 5.99 MiB | ONNX (single file) | MIT | |
| | `wespeaker_resnet34_lm.onnx` | 256 KiB | ONNX header (external data) | Apache-2.0 | |
| | `wespeaker_resnet34_lm.onnx.data` | 25.3 MiB | external-data weights | Apache-2.0 | |
| | `wespeaker_resnet34_lm_packed.onnx` | 25.5 MiB | ONNX (single file, repacked) | Apache-2.0 | |
| | `wespeaker_resnet34_lm.pt` | 25.6 MiB | TorchScript | Apache-2.0 | |
| | `plda/eigenvectors_desc.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 | |
| | `plda/lda.bin` | 256 KiB | f64 (256×128 row-major) | CC-BY-4.0 | |
| | `plda/mean1.bin` | 2 KiB | f64 (256,) | CC-BY-4.0 | |
| | `plda/mean2.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 | |
| | `plda/mu.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 | |
| | `plda/phi_desc.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 | |
| | `plda/psi.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 | |
| | `plda/tr.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 | |
| | `plda/plda.npz` | 131 KiB | numpy (`mu`, `tr`, `psi`) | CC-BY-4.0 | |
| | `plda/xvec_transform.npz` | 131 KiB | numpy (`mean1`, `mean2`, `lda`) | CC-BY-4.0 | |
|
|
| ## Which file do I want? |
|
|
| ### Segmentation |
| Use `segmentation-3.0.onnx`. It feeds `dia::segment::SegmentModel` |
| (or any pyannote-segmentation-compatible runtime). Single file, no |
| external data, works on every ORT execution provider. |
|
|
| ### Embedding (WeSpeaker) |
| Three forms, same weights, pick by use case: |
|
|
| - **`wespeaker_resnet34_lm.onnx` + `wespeaker_resnet34_lm.onnx.data`** |
| — the default ONNX layout. Loads on CPU / TensorRT / CUDA / OpenVINO |
| / DirectML. The `.onnx` and `.onnx.data` files MUST sit next to |
| each other on disk; ORT resolves the external pointer by relative |
| path. |
| - **`wespeaker_resnet34_lm_packed.onnx`** — same model with all |
| weights inlined into one file. Use this if you want a single-file |
| artifact, or if the runtime is **CoreML** (Apple Silicon — Apple's |
| graph optimizer chokes on external initializers and reports |
| `model_path must not be empty`; the packed form sidesteps it). |
| Otherwise functionally identical. |
| - **`wespeaker_resnet34_lm.pt`** — TorchScript export for the |
| `tch` backend. Bit-exact to upstream PyTorch on hard cases (heavy- |
| overlap fixtures where the ONNX→ORT path can drift by O(1) per |
| element). Pulls in libtorch (~600 MB shared library). |
|
|
| ### PLDA |
| The eight `.bin` files are the runtime data — raw little-endian f64 |
| blobs that `dia::plda` embeds via `include_bytes!`. The two `.npz` |
| files are the build-time sources (`xvec_transform.npz` exposes |
| `mean1` / `mean2` / `lda`; `plda.npz` exposes `mu` / `tr` / |
| `psi`); they are mirrored from the upstream pyannote-community-1 |
| snapshot for traceability and so the `.bin` extraction can be |
| re-run via `scripts/extract-plda-blobs.sh` in the dia repo. |
|
|
| `eigenvectors_desc.bin` and `phi_desc.bin` are scipy-derived |
| eigenvectors of the PLDA generalized eigenproblem `(B, W)` — pinned |
| to avoid LAPACK eigenvector-sign indeterminism (which produced a |
| 38% DER divergence on three-speaker fixtures when nalgebra and |
| scipy disagreed on 67 of 128 column signs). See |
| [`models/plda/SOURCE.md`](https://github.com/al8n/diarization/blob/main/models/plda/SOURCE.md) |
| in the dia repo for the regeneration procedure. |
|
|
| ## Provenance |
|
|
| ### segmentation-3.0.onnx |
| - **Upstream:** [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0) |
| - **Original layout:** `pytorch_model.onnx` in the upstream HF repo. |
| - **License:** MIT — Copyright (c) 2023 CNRS |
| - **Author:** Hervé Bredin (CNRS / IRIT), pyannote.audio author and |
| lead trainer. |
| - **SHA-256:** `057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25` |
|
|
| ### wespeaker_resnet34_lm.onnx (+ .data) / .pt / _packed.onnx |
| - **Upstream model architecture:** WeSpeaker ResNet34 with |
| large-margin (LM) angular fine-tuning, trained on VoxCeleb-2. |
| - **Upstream sources:** |
| - [WeSpeaker project](https://github.com/wenet-e2e/wespeaker) (Apache-2.0) |
| - [`onnx-community/wespeaker_resnet34_lm`](https://huggingface.co/onnx-community/wespeaker_resnet34_lm) |
| for the ONNX export. |
| - **License:** Apache-2.0. |
| - **`_packed.onnx` derivative:** produced by loading |
| `wespeaker_resnet34_lm.onnx` + `.onnx.data` via the `onnx` Python |
| library (`onnx.load(path, load_external_data=True)`) and re-saving |
| with `save_as_external_data=False`. Same weights, no external file. |
|
|
| ### plda/ |
| - **Upstream:** [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1) |
| - **License:** CC-BY-4.0 |
| - **Snapshot revision:** `3533c8cf8e369892e6b79ff1bf80f7b0286a54ee` |
| - **Original layout in the upstream HF repo:** |
| `plda/xvec_transform.npz` and `plda/plda.npz`. |
| - **Attribution (per upstream `plda/README.md`):** |
| PLDA model trained by [BUT Speech@FIT](https://speech.fit.vut.cz/); |
| integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka. |
|
|
| ## Usage |
|
|
| ### From `dia` (Rust) |
| ```rust |
| use diarization::{ |
| embed::EmbedModel, |
| plda::PldaTransform, |
| segment::SegmentModel, |
| }; |
| // Segmentation + PLDA are bundled by default — no download needed. |
| let mut seg = SegmentModel::bundled()?; |
| let plda = PldaTransform::new()?; |
| // WeSpeaker is BYO; download from this repo. |
| let mut emb = EmbedModel::from_file("wespeaker_resnet34_lm.onnx")?; |
| # Ok::<(), Box<dyn std::error::Error>>(()) |
| ``` |
|
|
| ### Direct download |
| ```bash |
| # whole bundle |
| hf download FinDIT-Studio/dia-models --local-dir ./dia-models |
| |
| # just the embedding model (default ONNX form) |
| hf download FinDIT-Studio/dia-models \ |
| wespeaker_resnet34_lm.onnx wespeaker_resnet34_lm.onnx.data \ |
| --local-dir ./models |
| |
| # CoreML-friendly single-file form |
| hf download FinDIT-Studio/dia-models \ |
| wespeaker_resnet34_lm_packed.onnx --local-dir ./models |
| ``` |
|
|
| ## Licenses |
|
|
| This repository **redistributes** model artifacts under three different |
| licenses. Each artifact retains its upstream license. By using this |
| bundle you agree to comply with **all three**: |
|
|
| - **MIT** for `segmentation-3.0.onnx` (Copyright © 2023 CNRS, Hervé Bredin). |
| See `LICENSE.MIT`. |
| - **Apache-2.0** for the WeSpeaker artifacts. See `LICENSE.APACHE-2.0`. |
| - **CC-BY-4.0** for everything under `plda/`. See `LICENSE.CC-BY-4.0`. |
| Required attribution: *PLDA model trained by BUT Speech@FIT; |
| integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.* |
|
|
| The `dia` Rust crate that consumes these models is itself dual-licensed |
| MIT OR Apache-2.0; that licensing applies to the source code, not to the |
| model weights bundled here. |
|
|
| ## Citation |
|
|
| If you use these weights in academic work, please cite the upstream |
| papers / model cards: |
|
|
| - **Segmentation-3.0:** Hervé Bredin, *pyannote.audio 2.1 speaker |
| diarization pipeline: principle, benchmark, and recipe*, Interspeech |
| 2023. |
| - **WeSpeaker:** Wang et al., *WeSpeaker: A research and production |
| oriented speaker embedding learning toolkit*, ICASSP 2023. |
| - **PLDA / VBx:** Landini et al., *Bayesian HMM clustering of x-vector |
| sequences (VBx) in speaker diarization: theory, implementation and |
| analysis on standard tasks*, Computer Speech & Language, 2022. |
|
|
| ## Issues / questions |
|
|
| This repo is a **redistribution** of upstream artifacts. Please file |
| issues against: |
|
|
| - The dia Rust crate: <https://github.com/al8n/diarization/issues> |
| - The pyannote.audio project: <https://github.com/pyannote/pyannote-audio/issues> |
| - The WeSpeaker project: <https://github.com/wenet-e2e/wespeaker/issues> |
|
|