Duplicate from aufklarer/Pyannote-Segmentation-MLX

6949b43 3 days ago

2.95 kB

license: mit
tags:
  - mlx
  - voice-activity-detection
  - speaker-segmentation
  - speaker-diarization
  - pyannote
  - apple-silicon
base_model: pyannote/segmentation-3.0
library_name: mlx
pipeline_tag: voice-activity-detection

Pyannote Segmentation 3.0 — MLX

MLX-compatible weights for pyannote/segmentation-3.0 (PyanNet), converted from the official PyTorch Lightning checkpoint with pre-computed SincNet filters.

Model

PyanNet is a speaker segmentation model (~1.5M params) that processes 10-second audio windows and outputs 7-class powerset probabilities for up to 3 simultaneous speakers. Used for both voice activity detection (binary) and speaker diarization (per-speaker).

Architecture: SincNet → BiLSTM(4 layers) → Linear(2 layers) → 7-class softmax

Output classes: non-speech, spk1, spk2, spk3, spk1+2, spk1+3, spk2+3

Usage (Swift / MLX)

import SpeechVAD

// Voice Activity Detection
let vad = try await PyannoteVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for seg in segments {
    print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}

// Speaker Diarization (with WeSpeaker embeddings)
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
    print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}

Part of qwen3-asr-swift.

Conversion

python3 scripts/convert_pyannote.py --token YOUR_HF_TOKEN --upload

Converts the gated pyannote/segmentation-3.0 checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:

SincNet: pre-compute 80 sinc bandpass filters (40 cos + 40 sin) from 40 learned (low_hz, band_hz) parameter pairs
Conv1d: transpose weights [O, I, K] → [O, K, I] for MLX channels-last
BiLSTM: split into forward/backward stacks, sum bias_ih + bias_hh
Linear/classifier: kept as-is

Weight Mapping

PyTorch Key	MLX Key	Shape
`sincnet.conv1d.0.filterbank.*` (computed)	`sincnet.conv.0.weight`	[80, 251, 1]
`sincnet.conv1d.{1,2}.weight`	`sincnet.conv.{1,2}.weight`	[O, K, I]
`sincnet.norm1d.{0-2}.*`	`sincnet.norm.{0-2}.*`	varies
`lstm.weight_ih_l{i}`	`lstm_fwd.layers.{i}.Wx`	[512, I]
`lstm.weight_hh_l{i}`	`lstm_fwd.layers.{i}.Wh`	[512, 128]
`lstm.bias_ih_l{i} + bias_hh_l{i}`	`lstm_fwd.layers.{i}.bias`	[512]
`lstm.*_reverse`	`lstm_bwd.layers.{i}.*`	same
`linear.{0,1}.*`	`linear.{0,1}.*`	varies
`classifier.*`	`classifier.*`	[7, 128]

License

The original pyannote segmentation model is released under the MIT License.