leduclinh's picture
Duplicate from aufklarer/Pyannote-Segmentation-MLX
6949b43
metadata
license: mit
tags:
  - mlx
  - voice-activity-detection
  - speaker-segmentation
  - speaker-diarization
  - pyannote
  - apple-silicon
base_model: pyannote/segmentation-3.0
library_name: mlx
pipeline_tag: voice-activity-detection

Pyannote Segmentation 3.0 — MLX

MLX-compatible weights for pyannote/segmentation-3.0 (PyanNet), converted from the official PyTorch Lightning checkpoint with pre-computed SincNet filters.

Model

PyanNet is a speaker segmentation model (~1.5M params) that processes 10-second audio windows and outputs 7-class powerset probabilities for up to 3 simultaneous speakers. Used for both voice activity detection (binary) and speaker diarization (per-speaker).

Architecture: SincNet → BiLSTM(4 layers) → Linear(2 layers) → 7-class softmax

Output classes: non-speech, spk1, spk2, spk3, spk1+2, spk1+3, spk2+3

Usage (Swift / MLX)

import SpeechVAD

// Voice Activity Detection
let vad = try await PyannoteVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for seg in segments {
    print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}

// Speaker Diarization (with WeSpeaker embeddings)
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
    print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}

Part of qwen3-asr-swift.

Conversion

python3 scripts/convert_pyannote.py --token YOUR_HF_TOKEN --upload

Converts the gated pyannote/segmentation-3.0 checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:

  • SincNet: pre-compute 80 sinc bandpass filters (40 cos + 40 sin) from 40 learned (low_hz, band_hz) parameter pairs
  • Conv1d: transpose weights [O, I, K][O, K, I] for MLX channels-last
  • BiLSTM: split into forward/backward stacks, sum bias_ih + bias_hh
  • Linear/classifier: kept as-is

Weight Mapping

PyTorch Key MLX Key Shape
sincnet.conv1d.0.filterbank.* (computed) sincnet.conv.0.weight [80, 251, 1]
sincnet.conv1d.{1,2}.weight sincnet.conv.{1,2}.weight [O, K, I]
sincnet.norm1d.{0-2}.* sincnet.norm.{0-2}.* varies
lstm.weight_ih_l{i} lstm_fwd.layers.{i}.Wx [512, I]
lstm.weight_hh_l{i} lstm_fwd.layers.{i}.Wh [512, 128]
lstm.bias_ih_l{i} + bias_hh_l{i} lstm_fwd.layers.{i}.bias [512]
lstm.*_reverse lstm_bwd.layers.{i}.* same
linear.{0,1}.* linear.{0,1}.* varies
classifier.* classifier.* [7, 128]

License

The original pyannote segmentation model is released under the MIT License.