initial bundle: segmentation-3.0 + wespeaker_resnet34_lm (3 forms) + PLDA weights, with attribution

8ecadc1 verified 10 days ago

8.97 kB

	---
	license: other
	license_name: mixed-mit-cc-by-4-apache-2
	license_link: LICENSE
	language:
	- en
	- multilingual
	library_name: onnx
	tags:
	- speaker-diarization
	- diarization
	- pyannote
	- speaker-embedding
	- wespeaker
	- segmentation
	pipeline_tag: voice-activity-detection
	---

	# dia-models — pyannote community-1 model bundle for the `dia` Rust crate

	A single-repo distribution of every model artifact the
	[`dia`](https://github.com/al8n/diarization) Rust crate needs to run
	end-to-end speaker diarization with pyannote-community-1 parity:

	- The segmentation-3.0 powerset speaker network (16 kHz audio →
	per-frame speaker activations).
	- The WeSpeaker ResNet34-LM speaker-embedding network, in three
	forms (external-data ONNX, single-file ONNX, TorchScript).
	- The PLDA whitening + LDA weights from the
	[`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
	pipeline, in both `.npz` (build-time) and raw little-endian f64
	`.bin` (runtime) form.

	`dia` already embeds the segmentation model and the PLDA weights into
	the compiled binary via `include_bytes!`; the WeSpeaker ONNX is
	the only artifact callers must download separately. This repo lets
	callers grab any individual model — or the whole bundle — without
	spelunking through the upstream pyannote / WeSpeaker repos.

	> Attribution: this is a redistribution, not new model training.
	> All weights come from upstream pyannote / WeSpeaker / BUT Speech@FIT.
	> The licenses below MUST be preserved by anyone redistributing.

	## Files

	\| File \| Size \| Format \| License \|
	\|---\|---:\|---\|---\|
	\| `segmentation-3.0.onnx` \| 5.99 MiB \| ONNX (single file) \| MIT \|
	\| `wespeaker_resnet34_lm.onnx` \| 256 KiB \| ONNX header (external data) \| Apache-2.0 \|
	\| `wespeaker_resnet34_lm.onnx.data` \| 25.3 MiB \| external-data weights \| Apache-2.0 \|
	\| `wespeaker_resnet34_lm_packed.onnx` \| 25.5 MiB \| ONNX (single file, repacked) \| Apache-2.0 \|
	\| `wespeaker_resnet34_lm.pt` \| 25.6 MiB \| TorchScript \| Apache-2.0 \|
	\| `plda/eigenvectors_desc.bin` \| 128 KiB \| f64 (128×128 row-major) \| CC-BY-4.0 \|
	\| `plda/lda.bin` \| 256 KiB \| f64 (256×128 row-major) \| CC-BY-4.0 \|
	\| `plda/mean1.bin` \| 2 KiB \| f64 (256,) \| CC-BY-4.0 \|
	\| `plda/mean2.bin` \| 1 KiB \| f64 (128,) \| CC-BY-4.0 \|
	\| `plda/mu.bin` \| 1 KiB \| f64 (128,) \| CC-BY-4.0 \|
	\| `plda/phi_desc.bin` \| 1 KiB \| f64 (128,) \| CC-BY-4.0 \|
	\| `plda/psi.bin` \| 1 KiB \| f64 (128,) \| CC-BY-4.0 \|
	\| `plda/tr.bin` \| 128 KiB \| f64 (128×128 row-major) \| CC-BY-4.0 \|
	\| `plda/plda.npz` \| 131 KiB \| numpy (`mu`, `tr`, `psi`) \| CC-BY-4.0 \|
	\| `plda/xvec_transform.npz` \| 131 KiB \| numpy (`mean1`, `mean2`, `lda`) \| CC-BY-4.0 \|

	## Which file do I want?

	### Segmentation
	Use `segmentation-3.0.onnx`. It feeds `dia::segment::SegmentModel`
	(or any pyannote-segmentation-compatible runtime). Single file, no
	external data, works on every ORT execution provider.

	### Embedding (WeSpeaker)
	Three forms, same weights, pick by use case:

	- `wespeaker_resnet34_lm.onnx` + `wespeaker_resnet34_lm.onnx.data`
	— the default ONNX layout. Loads on CPU / TensorRT / CUDA / OpenVINO
	/ DirectML. The `.onnx` and `.onnx.data` files MUST sit next to
	each other on disk; ORT resolves the external pointer by relative
	path.
	- `wespeaker_resnet34_lm_packed.onnx` — same model with all
	weights inlined into one file. Use this if you want a single-file
	artifact, or if the runtime is CoreML (Apple Silicon — Apple's
	graph optimizer chokes on external initializers and reports
	`model_path must not be empty`; the packed form sidesteps it).
	Otherwise functionally identical.
	- `wespeaker_resnet34_lm.pt` — TorchScript export for the
	`tch` backend. Bit-exact to upstream PyTorch on hard cases (heavy-
	overlap fixtures where the ONNX→ORT path can drift by O(1) per
	element). Pulls in libtorch (~600 MB shared library).

	### PLDA
	The eight `.bin` files are the runtime data — raw little-endian f64
	blobs that `dia::plda` embeds via `include_bytes!`. The two `.npz`
	files are the build-time sources (`xvec_transform.npz` exposes
	`mean1` / `mean2` / `lda`; `plda.npz` exposes `mu` / `tr` /
	`psi`); they are mirrored from the upstream pyannote-community-1
	snapshot for traceability and so the `.bin` extraction can be
	re-run via `scripts/extract-plda-blobs.sh` in the dia repo.

	`eigenvectors_desc.bin` and `phi_desc.bin` are scipy-derived
	eigenvectors of the PLDA generalized eigenproblem `(B, W)` — pinned
	to avoid LAPACK eigenvector-sign indeterminism (which produced a
	38% DER divergence on three-speaker fixtures when nalgebra and
	scipy disagreed on 67 of 128 column signs). See
	[`models/plda/SOURCE.md`](https://github.com/al8n/diarization/blob/main/models/plda/SOURCE.md)
	in the dia repo for the regeneration procedure.

	## Provenance

	### segmentation-3.0.onnx
	- Upstream: [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0)
	- Original layout: `pytorch_model.onnx` in the upstream HF repo.
	- License: MIT — Copyright (c) 2023 CNRS
	- Author: Hervé Bredin (CNRS / IRIT), pyannote.audio author and
	lead trainer.
	- SHA-256: `057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25`

	### wespeaker_resnet34_lm.onnx (+ .data) / .pt / _packed.onnx
	- Upstream model architecture: WeSpeaker ResNet34 with
	large-margin (LM) angular fine-tuning, trained on VoxCeleb-2.
	- Upstream sources:
	- [WeSpeaker project](https://github.com/wenet-e2e/wespeaker) (Apache-2.0)
	- [`onnx-community/wespeaker_resnet34_lm`](https://huggingface.co/onnx-community/wespeaker_resnet34_lm)
	for the ONNX export.
	- License: Apache-2.0.
	- `_packed.onnx` derivative: produced by loading
	`wespeaker_resnet34_lm.onnx` + `.onnx.data` via the `onnx` Python
	library (`onnx.load(path, load_external_data=True)`) and re-saving
	with `save_as_external_data=False`. Same weights, no external file.

	### plda/
	- Upstream: [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1)
	- License: CC-BY-4.0
	- Snapshot revision: `3533c8cf8e369892e6b79ff1bf80f7b0286a54ee`
	- Original layout in the upstream HF repo:
	`plda/xvec_transform.npz` and `plda/plda.npz`.
	- Attribution (per upstream `plda/README.md`):
	PLDA model trained by [BUT Speech@FIT](https://speech.fit.vut.cz/);
	integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.

	## Usage

	### From `dia` (Rust)
	```rust
	use diarization::{
	embed::EmbedModel,
	plda::PldaTransform,
	segment::SegmentModel,
	};
	// Segmentation + PLDA are bundled by default — no download needed.
	let mut seg = SegmentModel::bundled()?;
	let plda = PldaTransform::new()?;
	// WeSpeaker is BYO; download from this repo.
	let mut emb = EmbedModel::from_file("wespeaker_resnet34_lm.onnx")?;
	# Ok::<(), Box<dyn std::error::Error>>(())
	```

	### Direct download
	```bash
	# whole bundle
	hf download FinDIT-Studio/dia-models --local-dir ./dia-models

	# just the embedding model (default ONNX form)
	hf download FinDIT-Studio/dia-models \
	wespeaker_resnet34_lm.onnx wespeaker_resnet34_lm.onnx.data \
	--local-dir ./models

	# CoreML-friendly single-file form
	hf download FinDIT-Studio/dia-models \
	wespeaker_resnet34_lm_packed.onnx --local-dir ./models
	```

	## Licenses

	This repository redistributes model artifacts under three different
	licenses. Each artifact retains its upstream license. By using this
	bundle you agree to comply with all three:

	- MIT for `segmentation-3.0.onnx` (Copyright © 2023 CNRS, Hervé Bredin).
	See `LICENSE.MIT`.
	- Apache-2.0 for the WeSpeaker artifacts. See `LICENSE.APACHE-2.0`.
	- CC-BY-4.0 for everything under `plda/`. See `LICENSE.CC-BY-4.0`.
	Required attribution: *PLDA model trained by BUT Speech@FIT;
	integration of VBx in pyannote.audio by Jiangyu Han and Petr Pálka.*

	The `dia` Rust crate that consumes these models is itself dual-licensed
	MIT OR Apache-2.0; that licensing applies to the source code, not to the
	model weights bundled here.

	## Citation

	If you use these weights in academic work, please cite the upstream
	papers / model cards:

	- Segmentation-3.0: Hervé Bredin, *pyannote.audio 2.1 speaker
	diarization pipeline: principle, benchmark, and recipe*, Interspeech
	2023.
	- WeSpeaker: Wang et al., *WeSpeaker: A research and production
	oriented speaker embedding learning toolkit*, ICASSP 2023.
	- PLDA / VBx: Landini et al., *Bayesian HMM clustering of x-vector
	sequences (VBx) in speaker diarization: theory, implementation and
	analysis on standard tasks*, Computer Speech & Language, 2022.

	## Issues / questions

	This repo is a redistribution of upstream artifacts. Please file
	issues against:

	- The dia Rust crate: <https://github.com/al8n/diarization/issues>
	- The pyannote.audio project: <https://github.com/pyannote/pyannote-audio/issues>
	- The WeSpeaker project: <https://github.com/wenet-e2e/wespeaker/issues>