Silero VAD v6 (MLX)

MLX-format weights for Silero VAD v6 (16 kHz branch), converted from the official silero_vad PyPI package.

This is the v6 companion to mlx-community/silero-vad, which contains the older v5 weights. Both are independent ports of different Silero release lines.

TL;DR

This repo mlx-community/silero-vad
Silero version v6 (latest) v5 (previous)
Source silero_vad PyPI v6 (torch.hub snakers4/silero-vad) onnx-community/silero-vad
Branches 16 kHz only 16 kHz + 8 kHz
File size 1.2 MB 2.1 MB
Layout vad_16k.* (mlx-audio convention) vad_16k.* + vad_8k.*
Parity vs upstream PyTorch bit-exact (max|Ξ”| = 0.0) bit-exact (max|Ξ”| = 0.0)

The two ports differ only in which Silero checkpoint they wrap β€” the architecture is identical (STFT β†’ 4Γ— Conv1d+ReLU β†’ LSTM(128) β†’ Conv1d β†’ Sigmoid). Quality and per-chunk latency are essentially equivalent on long-form English meeting audio (see Quality below).

Architecture

Input:        audio (1, 576) = 64-sample context + 512-sample chunk
              + LSTM state (h: 1Γ—128, c: 1Γ—128)

Pre-process:  Reflection pad right (+64) β†’ 640 samples
Learned STFT: Conv1d(1, 258, k=256, s=128) β†’ magnitude β†’ (1, 4, 129)
Encoder:      Conv1d(129β†’128) β†’ ReLU
              Conv1d(128β†’64,  s=2) β†’ ReLU
              Conv1d(64β†’64,   s=2) β†’ ReLU
              Conv1d(64β†’128)  β†’ ReLU                 β†’ (1, 1, 128)
LSTM:         LSTMCell(128, 128) β€” state carried across chunks for streaming
Decoder:      ReLU β†’ Conv1d(128β†’1, k=1) β†’ Sigmoid    β†’ probability

Total parameters: ~309K, ~1.2 MB on disk
Streaming:        carry (h, c) across calls; per-chunk decision in <1 ms

Tensor inventory

vad_16k.stft_conv.weight       [258, 256, 1]   β€” frozen learned DFT basis
vad_16k.conv1.weight           [128, 3, 129]   β€” encoder 0 (BN-fused reparameterized)
vad_16k.conv1.bias             [128]
vad_16k.conv2.weight           [64,  3, 128]   β€” encoder 1 (stride 2)
vad_16k.conv2.bias             [64]
vad_16k.conv3.weight           [64,  3, 64]    β€” encoder 2 (stride 2)
vad_16k.conv3.bias             [64]
vad_16k.conv4.weight           [128, 3, 64]    β€” encoder 3
vad_16k.conv4.bias             [128]
vad_16k.lstm.Wx                [512, 128]      β€” input gate weights, [4Β·H, D] gate order i,f,g,o
vad_16k.lstm.Wh                [512, 128]      β€” hidden gate weights
vad_16k.lstm.bias              [512]           β€” fused bias_ih + bias_hh
vad_16k.final_conv.weight      [1, 1, 128]
vad_16k.final_conv.bias        [1]

All Conv1d weights are stored in MLX channels-last layout [O, K, I]. LSTM bias is the sum of PyTorch's bias_ih + bias_hh (single-tensor MLX convention).

Files

  • model.safetensors β€” MLX-format weights (16 kHz branch, vad_16k.* layout). Same weights serve both inference modes.
  • config.json β€” model metadata + branch parameters
  • convert.py β€” reproducible PyTorch β†’ MLX conversion script
  • example.py β€” 32ms streaming inference example (per-chunk decisions; live mic / streaming use cases)
  • example_256ms.py β€” 256ms unified inference example (8 internal chunks per call with noisy-OR aggregation; faster wall time for offline ASR preprocessing)

Conversion

The bundled convert.py produces this repo's model.safetensors from the upstream PyPI silero_vad package:

uv pip install silero-vad safetensors numpy
python convert.py --output model.safetensors

What it does:

  1. Loads PyTorch state_dict via silero_vad.load_silero_vad()
  2. Drops the 8 kHz branch (this repo ships 16 kHz only)
  3. Transposes Conv1d weights [O, I, K] β†’ [O, K, I] (MLX channels-last)
  4. Sums LSTM bias_ih + bias_hh β†’ single [4H] bias
  5. Maps PyTorch keys to mlx-audio's vad_16k.* convention
  6. Saves as safetensors with metadata

LSTM gate ordering is i, f, g, o along the [4H] axis β€” the same in both PyTorch and MLX, so weights pass through unchanged.

Usage

Quick start (Python + MLX)

uv pip install mlx safetensors numpy huggingface_hub

# 32ms streaming (live mic, per-chunk decision)
uv run python example.py /path/to/audio_16k_mono.wav

# 256ms unified (offline batch / ASR preprocessing β€” ~1.7Γ— faster wall)
uv run python example_256ms.py /path/to/audio_16k_mono.wav

Both examples read weights directly from this repo via huggingface_hub. The same model.safetensors serves both modes; only the inference loop differs.

Choosing a mode

32ms streaming (example.py) 256ms unified (example_256ms.py)
Per-call output 1 probability per 32 ms 1 probability per 256 ms (8 internal chunks aggregated via noisy-OR)
Decision latency 32 ms 256 ms
MLX wall throughput 1Γ— ~1.7Γ— faster (fewer mx.eval barriers)
Best for Live microphone, real-time gating Offline ASR preprocessing, batch / file-based VAD

Manual loading

import mlx.core as mx
from huggingface_hub import hf_hub_download
from safetensors.numpy import load_file

path = hf_hub_download("mlx-community/silero-vad-v6", "model.safetensors")
weights = load_file(path)
weights = {k: mx.array(v) for k, v in weights.items()}
# weights["vad_16k.stft_conv.weight"].shape == (258, 256, 1)

For the full forward pass, see example.py (β‰ˆ 80 lines, no external dependencies beyond mlx, numpy, safetensors, huggingface_hub).

Streaming protocol

Per-chunk inputs (32 ms at 16 kHz):

  • 64 context samples carried from the previous chunk's tail
  • 512 new audio samples
  • LSTM h, c state ([1, 128] each) carried across chunks

Output: scalar speech probability ∈ [0, 1]. A standard threshold of 0.5 works well on most material; tune via config.json::threshold for your use case.

Quality

Frame-level F1 against VibeVoice ASR segment-derived speech labels on a 44-minute English meeting clip (playback-eng-16k.wav, 83,190 chunks at 32 ms resolution, GT speech ratio 98.9%):

Threshold v6 (this repo) F1 v5 (mlx-community/silero-vad) F1 Ξ”
0.30 0.8656 0.8692 +0.004 v5
0.40 0.8612 0.8649 +0.004 v5
0.50 0.8572 0.8607 +0.003 v5
0.60 0.8534 0.8561 +0.003 v5
0.70 0.8492 0.8510 +0.002 v5

At threshold 0.5: precision β‰ˆ 0.998 for both, recall 0.751 vs 0.757. The two versions are essentially equivalent within measurement noise on this sample. The v5 edge of ~0.4% F1 is too small relative to single-sample variance, GT labeling granularity (segment-level rather than word-level), and the high class imbalance to draw a generalised "v5 is better" conclusion.

A broader multi-domain quality comparison (clean / noisy / far-field / multilingual) would be needed for a definitive ranking.

Performance

Bit-exact parity with the upstream PyTorch JIT model (max|Ξ”| = 0.0 across 83K chunks of test audio).

32ms streaming, per-chunk single-call latency (M1 Max):

Backend p50 / chunk
MLX (CPU stream + mx.compile) 0.75 ms
MLX (GPU naive, no compile) 1.46 ms

On newer hardware (M5 Max), per-chunk async-batched throughput drops to ~0.17 ms/chunk in CPU stream mode (β‰ˆ 187Γ— real-time), per lucasnewman's PR701 benchmark.

256ms unified vs 32ms streaming, offline VAD on 10-min English meeting (M1 Max):

Mode VAD wall time Speedup vs 32ms
32ms streaming MLX ~36 s 1.0Γ—
256ms unified MLX ~16 s ~2.3Γ— faster
CoreML 256ms (FluidInference, native Apple HW) ~12 s ~3.0Γ— faster

The 256ms unified mode is the right choice for offline ASR preprocessing; its eight internal 32ms chunks per outer call amortise MLX dispatch overhead through mx.eval barrier reduction. CoreML 256ms remains the absolute fastest on Apple Silicon (ANE/BNNS-tuned), but pure-MLX 256ms closes ~50% of the gap.

License

MIT (matching the upstream Silero VAD license).

Citation

@misc{silero_vad_2024,
  author = {Silero Team},
  title  = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector},
  year   = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}}
}

Acknowledgments

Downloads last month
239
Safetensors
Model size
309k params
Tensor type
F32
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support