aufklarer
/

Pyannote-Segmentation-MLX

Voice Activity Detection

pyannote-segmentation

speaker-segmentation

speaker-diarization

Model card Files Files and versions

Pyannote-Segmentation-MLX / README.md

aufklarer's picture

Add model card

32b8137 verified 7 days ago

|

history blame contribute delete

2.95 kB

	---
	license: mit
	tags:
	- mlx
	- voice-activity-detection
	- speaker-segmentation
	- speaker-diarization
	- pyannote
	- apple-silicon
	base_model: pyannote/segmentation-3.0
	library_name: mlx
	pipeline_tag: voice-activity-detection
	---

	# Pyannote Segmentation 3.0 — MLX

	MLX-compatible weights for [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) (PyanNet), converted from the official PyTorch Lightning checkpoint with pre-computed SincNet filters.

	## Model

	PyanNet is a speaker segmentation model (~1.5M params) that processes 10-second audio windows and outputs 7-class powerset probabilities for up to 3 simultaneous speakers. Used for both voice activity detection (binary) and speaker diarization (per-speaker).

	Architecture: SincNet → BiLSTM(4 layers) → Linear(2 layers) → 7-class softmax

	Output classes: non-speech, spk1, spk2, spk3, spk1+2, spk1+3, spk2+3

	## Usage (Swift / MLX)

	```swift
	import SpeechVAD

	// Voice Activity Detection
	let vad = try await PyannoteVADModel.fromPretrained()
	let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
	for seg in segments {
	print("Speech: \(seg.startTime)s - \(seg.endTime)s")
	}

	// Speaker Diarization (with WeSpeaker embeddings)
	let pipeline = try await DiarizationPipeline.fromPretrained()
	let result = pipeline.diarize(audio: samples, sampleRate: 16000)
	for seg in result.segments {
	print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
	}
	```

	Part of [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift).

	## Conversion

	```bash
	python3 scripts/convert_pyannote.py --token YOUR_HF_TOKEN --upload
	```

	Converts the gated pyannote/segmentation-3.0 checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:

	- SincNet: pre-compute 80 sinc bandpass filters (40 cos + 40 sin) from 40 learned `(low_hz, band_hz)` parameter pairs
	- Conv1d: transpose weights `[O, I, K]` → `[O, K, I]` for MLX channels-last
	- BiLSTM: split into forward/backward stacks, sum `bias_ih + bias_hh`
	- Linear/classifier: kept as-is

	## Weight Mapping

	\| PyTorch Key \| MLX Key \| Shape \|
	\|-------------\|---------\|-------\|
	\| `sincnet.conv1d.0.filterbank.*` (computed) \| `sincnet.conv.0.weight` \| [80, 251, 1] \|
	\| `sincnet.conv1d.{1,2}.weight` \| `sincnet.conv.{1,2}.weight` \| [O, K, I] \|
	\| `sincnet.norm1d.{0-2}.` \| `sincnet.norm.{0-2}.` \| varies \|
	\| `lstm.weight_ih_l{i}` \| `lstm_fwd.layers.{i}.Wx` \| [512, I] \|
	\| `lstm.weight_hh_l{i}` \| `lstm_fwd.layers.{i}.Wh` \| [512, 128] \|
	\| `lstm.bias_ih_l{i} + bias_hh_l{i}` \| `lstm_fwd.layers.{i}.bias` \| [512] \|
	\| `lstm._reverse` \| `lstm_bwd.layers.{i}.` \| same \|
	\| `linear.{0,1}.` \| `linear.{0,1}.` \| varies \|
	\| `classifier.` \| `classifier.` \| [7, 128] \|

	## License

	The original pyannote segmentation model is released under the [MIT License](https://github.com/pyannote/pyannote-audio/blob/develop/LICENSE).