| | --- |
| | license: mit |
| | tags: |
| | - mlx |
| | - voice-activity-detection |
| | - speaker-segmentation |
| | - speaker-diarization |
| | - pyannote |
| | - apple-silicon |
| | base_model: pyannote/segmentation-3.0 |
| | library_name: mlx |
| | pipeline_tag: voice-activity-detection |
| | --- |
| | |
| | # Pyannote Segmentation 3.0 β MLX |
| |
|
| | MLX-compatible weights for [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) (PyanNet), converted from the official PyTorch Lightning checkpoint with pre-computed SincNet filters. |
| |
|
| | ## Model |
| |
|
| | PyanNet is a speaker segmentation model (~1.5M params) that processes 10-second audio windows and outputs 7-class powerset probabilities for up to 3 simultaneous speakers. Used for both voice activity detection (binary) and speaker diarization (per-speaker). |
| |
|
| | **Architecture:** SincNet β BiLSTM(4 layers) β Linear(2 layers) β 7-class softmax |
| |
|
| | **Output classes:** non-speech, spk1, spk2, spk3, spk1+2, spk1+3, spk2+3 |
| |
|
| | ## Usage (Swift / MLX) |
| |
|
| | ```swift |
| | import SpeechVAD |
| | |
| | // Voice Activity Detection |
| | let vad = try await PyannoteVADModel.fromPretrained() |
| | let segments = vad.detectSpeech(audio: samples, sampleRate: 16000) |
| | for seg in segments { |
| | print("Speech: \(seg.startTime)s - \(seg.endTime)s") |
| | } |
| | |
| | // Speaker Diarization (with WeSpeaker embeddings) |
| | let pipeline = try await DiarizationPipeline.fromPretrained() |
| | let result = pipeline.diarize(audio: samples, sampleRate: 16000) |
| | for seg in result.segments { |
| | print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s") |
| | } |
| | ``` |
| |
|
| | Part of [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift). |
| |
|
| | ## Conversion |
| |
|
| | ```bash |
| | python3 scripts/convert_pyannote.py --token YOUR_HF_TOKEN --upload |
| | ``` |
| |
|
| | Converts the gated pyannote/segmentation-3.0 checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations: |
| |
|
| | - **SincNet**: pre-compute 80 sinc bandpass filters (40 cos + 40 sin) from 40 learned `(low_hz, band_hz)` parameter pairs |
| | - **Conv1d**: transpose weights `[O, I, K]` β `[O, K, I]` for MLX channels-last |
| | - **BiLSTM**: split into forward/backward stacks, sum `bias_ih + bias_hh` |
| | - **Linear/classifier**: kept as-is |
| |
|
| | ## Weight Mapping |
| |
|
| | | PyTorch Key | MLX Key | Shape | |
| | |-------------|---------|-------| |
| | | `sincnet.conv1d.0.filterbank.*` (computed) | `sincnet.conv.0.weight` | [80, 251, 1] | |
| | | `sincnet.conv1d.{1,2}.weight` | `sincnet.conv.{1,2}.weight` | [O, K, I] | |
| | | `sincnet.norm1d.{0-2}.*` | `sincnet.norm.{0-2}.*` | varies | |
| | | `lstm.weight_ih_l{i}` | `lstm_fwd.layers.{i}.Wx` | [512, I] | |
| | | `lstm.weight_hh_l{i}` | `lstm_fwd.layers.{i}.Wh` | [512, 128] | |
| | | `lstm.bias_ih_l{i} + bias_hh_l{i}` | `lstm_fwd.layers.{i}.bias` | [512] | |
| | | `lstm.*_reverse` | `lstm_bwd.layers.{i}.*` | same | |
| | | `linear.{0,1}.*` | `linear.{0,1}.*` | varies | |
| | | `classifier.*` | `classifier.*` | [7, 128] | |
| |
|
| | ## License |
| |
|
| | The original pyannote segmentation model is released under the [MIT License](https://github.com/pyannote/pyannote-audio/blob/develop/LICENSE). |
| |
|