Voice Activity Detection
MLX
Safetensors
pyannote-segmentation
speaker-segmentation
speaker-diarization
pyannote
apple-silicon
Instructions to use PapaMoth/Pyannote-Segmentation-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use PapaMoth/Pyannote-Segmentation-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Pyannote-Segmentation-MLX PapaMoth/Pyannote-Segmentation-MLX
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: mit | |
| tags: | |
| - mlx | |
| - voice-activity-detection | |
| - speaker-segmentation | |
| - speaker-diarization | |
| - pyannote | |
| - apple-silicon | |
| base_model: pyannote/segmentation-3.0 | |
| library_name: mlx | |
| pipeline_tag: voice-activity-detection | |
| # Pyannote Segmentation 3.0 — MLX | |
| MLX-compatible weights for [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) (PyanNet), converted from the official PyTorch Lightning checkpoint with pre-computed SincNet filters. | |
| ## Model | |
| PyanNet is a speaker segmentation model (~1.5M params) that processes 10-second audio windows and outputs 7-class powerset probabilities for up to 3 simultaneous speakers. Used for both voice activity detection (binary) and speaker diarization (per-speaker). | |
| **Architecture:** SincNet → BiLSTM(4 layers) → Linear(2 layers) → 7-class softmax | |
| **Output classes:** non-speech, spk1, spk2, spk3, spk1+2, spk1+3, spk2+3 | |
| ## Usage (Swift / MLX) | |
| ```swift | |
| import SpeechVAD | |
| // Voice Activity Detection | |
| let vad = try await PyannoteVADModel.fromPretrained() | |
| let segments = vad.detectSpeech(audio: samples, sampleRate: 16000) | |
| for seg in segments { | |
| print("Speech: \(seg.startTime)s - \(seg.endTime)s") | |
| } | |
| // Speaker Diarization (with WeSpeaker embeddings) | |
| let pipeline = try await DiarizationPipeline.fromPretrained() | |
| let result = pipeline.diarize(audio: samples, sampleRate: 16000) | |
| for seg in result.segments { | |
| print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s") | |
| } | |
| ``` | |
| Part of [speech-swift](https://github.com/soniqo/speech-swift). | |
| ## Conversion | |
| ```bash | |
| python3 scripts/convert_pyannote.py --token YOUR_HF_TOKEN --upload | |
| ``` | |
| Converts the gated pyannote/segmentation-3.0 checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations: | |
| - **SincNet**: pre-compute 80 sinc bandpass filters (40 cos + 40 sin) from 40 learned `(low_hz, band_hz)` parameter pairs | |
| - **Conv1d**: transpose weights `[O, I, K]` → `[O, K, I]` for MLX channels-last | |
| - **BiLSTM**: split into forward/backward stacks, sum `bias_ih + bias_hh` | |
| - **Linear/classifier**: kept as-is | |
| ## Weight Mapping | |
| | PyTorch Key | MLX Key | Shape | | |
| |-------------|---------|-------| | |
| | `sincnet.conv1d.0.filterbank.*` (computed) | `sincnet.conv.0.weight` | [80, 251, 1] | | |
| | `sincnet.conv1d.{1,2}.weight` | `sincnet.conv.{1,2}.weight` | [O, K, I] | | |
| | `sincnet.norm1d.{0-2}.*` | `sincnet.norm.{0-2}.*` | varies | | |
| | `lstm.weight_ih_l{i}` | `lstm_fwd.layers.{i}.Wx` | [512, I] | | |
| | `lstm.weight_hh_l{i}` | `lstm_fwd.layers.{i}.Wh` | [512, 128] | | |
| | `lstm.bias_ih_l{i} + bias_hh_l{i}` | `lstm_fwd.layers.{i}.bias` | [512] | | |
| | `lstm.*_reverse` | `lstm_bwd.layers.{i}.*` | same | | |
| | `linear.{0,1}.*` | `linear.{0,1}.*` | varies | | |
| | `classifier.*` | `classifier.*` | [7, 128] | | |
| ## License | |
| The original pyannote segmentation model is released under the [MIT License](https://github.com/pyannote/pyannote-audio/blob/develop/LICENSE). | |
| --- | |
| ## Links | |
| - **Blog**: [blog.ivan.digital](https://blog.ivan.digital) | |
| - **Library Docs**: [soniqo.audio](https://soniqo.audio) | |