File size: 8,971 Bytes

8ecadc1

---
license: other
license_name: mixed-mit-cc-by-4-apache-2
license_link: LICENSE
language:
- en
- multilingual
library_name: onnx
tags:
- speaker-diarization
- diarization
- pyannote
- speaker-embedding
- wespeaker
- segmentation
pipeline_tag: voice-activity-detection
---

# dia-models — pyannote community-1 model bundle for the `dia` Rust crate

A single-repo distribution of every model artifact the
[`dia`](https://github.com/al8n/diarization) Rust crate needs to run
end-to-end speaker diarization with **pyannote-community-1** parity:

- The **segmentation-3.0** powerset speaker network (16 kHz audio →
  per-frame speaker activations).
- The **WeSpeaker ResNet34-LM** speaker-embedding network, in three
  forms (external-data ONNX, single-file ONNX, TorchScript).
- The **PLDA** whitening + LDA weights from the
  [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
  pipeline, in both `.npz` (build-time) and raw little-endian f64
  `.bin` (runtime) form.

`dia` already embeds the segmentation model and the PLDA weights into
the compiled binary via `include_bytes!`; the **WeSpeaker** ONNX is
the only artifact callers must download separately. This repo lets
callers grab any individual model — or the whole bundle — without
spelunking through the upstream pyannote / WeSpeaker repos.

> **Attribution: this is a redistribution, not new model training.**
> All weights come from upstream pyannote / WeSpeaker / BUT Speech@FIT.
> The licenses below MUST be preserved by anyone redistributing.

## Files

| File | Size | Format | License |
|---|---:|---|---|
| `segmentation-3.0.onnx` | 5.99 MiB | ONNX (single file) | MIT |
| `wespeaker_resnet34_lm.onnx` | 256 KiB | ONNX header (external data) | Apache-2.0 |
| `wespeaker_resnet34_lm.onnx.data` | 25.3 MiB | external-data weights | Apache-2.0 |
| `wespeaker_resnet34_lm_packed.onnx` | 25.5 MiB | ONNX (single file, repacked) | Apache-2.0 |
| `wespeaker_resnet34_lm.pt` | 25.6 MiB | TorchScript | Apache-2.0 |
| `plda/eigenvectors_desc.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 |
| `plda/lda.bin` | 256 KiB | f64 (256×128 row-major) | CC-BY-4.0 |
| `plda/mean1.bin` | 2 KiB | f64 (256,) | CC-BY-4.0 |
| `plda/mean2.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/mu.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/phi_desc.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/psi.bin` | 1 KiB | f64 (128,) | CC-BY-4.0 |
| `plda/tr.bin` | 128 KiB | f64 (128×128 row-major) | CC-BY-4.0 |
| `plda/plda.npz` | 131 KiB | numpy (`mu`, `tr`, `psi`) | CC-BY-4.0 |
| `plda/xvec_transform.npz` | 131 KiB | numpy (`mean1`, `mean2`, `lda`) | CC-BY-4.0 |

## Which file do I want?

### Segmentation
Use `segmentation-3.0.onnx`. It feeds `dia::segment::SegmentModel`
(or any pyannote-segmentation-compatible runtime). Single file, no
external data, works on every ORT execution provider.

### Embedding (WeSpeaker)
Three forms, same weights, pick by use case:

- **`wespeaker_resnet34_lm.onnx` + `wespeaker_resnet34_lm.onnx.data`**
  — the default ONNX layout. Loads on CPU / TensorRT / CUDA / OpenVINO
  / DirectML. The `.onnx` and `.onnx.data` files MUST sit next to
  each other on disk; ORT resolves the external pointer by relative
  path.
- **`wespeaker_resnet34_lm_packed.onnx`** — same model with all
  weights inlined into one file. Use this if you want a single-file
  artifact, or if the runtime is **CoreML** (Apple Silicon — Apple's
  graph optimizer chokes on external initializers and reports
  `model_path must not be empty`; the packed form sidesteps it).
  Otherwise functionally identical.
- **`wespeaker_resnet34_lm.pt`** — TorchScript export for the
  `tch` backend. Bit-exact to upstream PyTorch on hard cases (heavy-
  overlap fixtures where the ONNX→ORT path can drift by O(1) per
  element). Pulls in libtorch (~600 MB shared library).

### PLDA
The eight `.bin` files are the runtime data — raw little-endian f64
blobs that `dia::plda` embeds via `include_bytes!`. The two `.npz`
files are the build-time sources (`xvec_transform.npz` exposes
`mean1` / `mean2` / `lda`; `plda.npz` exposes `mu` / `tr` /
`psi`); they are mirrored from the upstream pyannote-community-1
snapshot for traceability and so the `.bin` extraction can be
re-run via `scripts/extract-plda-blobs.sh` in the dia repo.

`eigenvectors_desc.bin` and `phi_desc.bin` are scipy-derived
eigenvectors of the PLDA generalized eigenproblem `(B, W)` — pinned
to avoid LAPACK eigenvector-sign indeterminism (which produced a
38% DER divergence on three-speaker fixtures when nalgebra and
scipy disagreed on 67 of 128 column signs). See
[`models/plda/SOURCE.md`](https://github.com/al8n/diarization/blob/main/models/plda/SOURCE.md)
in the dia repo for the regeneration procedure.

## Provenance

### segmentation-3.0.onnx
- **Upstream:** [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0)
- **Original layout:** `pytorch_model.onnx` in the upstream HF repo.
- **License:** MIT — Copyright (c) 2023 CNRS
- **Author:** Hervé Bredin (CNRS / IRIT), pyannote.audio author and
  lead trainer.
- **SHA-256:** `057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25`

### wespeaker_resnet34_lm.onnx (+ .data) / .pt / _packed.onnx
- **Upstream model architecture:** WeSpeaker ResNet34 with
  large-margin (LM) angular fine-tuning, trained on VoxCeleb-2.
- **Upstream sources:**
  - [WeSpeaker project](https://github.com/wenet-e2e/wespeaker) (Apache-2.0)
  - [`onnx-community/wespeaker_resnet34_lm`](https://huggingface.co/onnx-community/wespeaker_resnet34_lm)
    for the ONNX export.
- **License:** Apache-2.0.
- **`_packed.onnx` derivative:** produced by loading
  `wespeaker_resnet34_lm.onnx` + `.onnx.data` via the `onnx` Python
  library (`onnx.load(path, load_external_data=True)`) and re-saving
  with `save_as_external_data=False`. Same weights, no external file.

### plda/
- **Upstream:** [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
- **License:** CC-BY-4.0
- **Snapshot revision:** `3533c8cf8e369892e6b79ff1bf80f7b0286a54ee`
- **Original layout in the upstream HF repo:**
  `plda/xvec_transform.npz` and `plda/plda.npz`.
- **Attribution (per upstream `plda/README.md`):**
  PLDA model trained by [BUT Speech@FIT](https://speech.fit.vut.cz/);
  integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.

## Usage

### From `dia` (Rust)
```rust
use diarization::{
  embed::EmbedModel,
  plda::PldaTransform,
  segment::SegmentModel,
};
// Segmentation + PLDA are bundled by default — no download needed.
let mut seg = SegmentModel::bundled()?;
let plda = PldaTransform::new()?;
// WeSpeaker is BYO; download from this repo.
let mut emb = EmbedModel::from_file("wespeaker_resnet34_lm.onnx")?;
# Ok::<(), Box<dyn std::error::Error>>(())
```

### Direct download
```bash
# whole bundle
hf download FinDIT-Studio/dia-models --local-dir ./dia-models

# just the embedding model (default ONNX form)
hf download FinDIT-Studio/dia-models \
  wespeaker_resnet34_lm.onnx wespeaker_resnet34_lm.onnx.data \
  --local-dir ./models

# CoreML-friendly single-file form
hf download FinDIT-Studio/dia-models \
  wespeaker_resnet34_lm_packed.onnx --local-dir ./models
```

## Licenses

This repository **redistributes** model artifacts under three different
licenses. Each artifact retains its upstream license. By using this
bundle you agree to comply with **all three**:

- **MIT** for `segmentation-3.0.onnx` (Copyright © 2023 CNRS, Hervé Bredin).
  See `LICENSE.MIT`.
- **Apache-2.0** for the WeSpeaker artifacts. See `LICENSE.APACHE-2.0`.
- **CC-BY-4.0** for everything under `plda/`. See `LICENSE.CC-BY-4.0`.
  Required attribution: *PLDA model trained by BUT Speech@FIT;
  integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.*

The `dia` Rust crate that consumes these models is itself dual-licensed
MIT OR Apache-2.0; that licensing applies to the source code, not to the
model weights bundled here.

## Citation

If you use these weights in academic work, please cite the upstream
papers / model cards:

- **Segmentation-3.0:** Hervé Bredin, *pyannote.audio 2.1 speaker
  diarization pipeline: principle, benchmark, and recipe*, Interspeech
  2023.
- **WeSpeaker:** Wang et al., *WeSpeaker: A research and production
  oriented speaker embedding learning toolkit*, ICASSP 2023.
- **PLDA / VBx:** Landini et al., *Bayesian HMM clustering of x-vector
  sequences (VBx) in speaker diarization: theory, implementation and
  analysis on standard tasks*, Computer Speech & Language, 2022.

## Issues / questions

This repo is a **redistribution** of upstream artifacts. Please file
issues against:

- The dia Rust crate: <https://github.com/al8n/diarization/issues>
- The pyannote.audio project: <https://github.com/pyannote/pyannote-audio/issues>
- The WeSpeaker project: <https://github.com/wenet-e2e/wespeaker/issues>