Silero VAD v6 (MLX)

MLX-format weights for Silero VAD v6 (16 kHz branch), converted from the official silero_vad PyPI package.

This is the v6 companion to mlx-community/silero-vad, which contains the older v5 weights. Both are independent ports of different Silero release lines.

TL;DR

	This repo	`mlx-community/silero-vad`
Silero version	v6 (latest)	v5 (previous)
Source	`silero_vad` PyPI v6 (torch.hub `snakers4/silero-vad`)	`onnx-community/silero-vad`
Branches	16 kHz only	16 kHz + 8 kHz
File size	1.2 MB	2.1 MB
Layout	`vad_16k.*` (mlx-audio convention)	`vad_16k.` + `vad_8k.`
Parity vs upstream PyTorch	bit-exact (max\|Δ\| = 0.0)	bit-exact (max\|Δ\| = 0.0)

The two ports differ only in which Silero checkpoint they wrap — the architecture is identical (STFT → 4× Conv1d+ReLU → LSTM(128) → Conv1d → Sigmoid). Quality and per-chunk latency are essentially equivalent on long-form English meeting audio (see Quality below).

Architecture

Input:        audio (1, 576) = 64-sample context + 512-sample chunk
              + LSTM state (h: 1×128, c: 1×128)

Pre-process:  Reflection pad right (+64) → 640 samples
Learned STFT: Conv1d(1, 258, k=256, s=128) → magnitude → (1, 4, 129)
Encoder:      Conv1d(129→128) → ReLU
              Conv1d(128→64,  s=2) → ReLU
              Conv1d(64→64,   s=2) → ReLU
              Conv1d(64→128)  → ReLU                 → (1, 1, 128)
LSTM:         LSTMCell(128, 128) — state carried across chunks for streaming
Decoder:      ReLU → Conv1d(128→1, k=1) → Sigmoid    → probability

Total parameters: ~309K, ~1.2 MB on disk
Streaming:        carry (h, c) across calls; per-chunk decision in <1 ms

Tensor inventory

vad_16k.stft_conv.weight       [258, 256, 1]   — frozen learned DFT basis
vad_16k.conv1.weight           [128, 3, 129]   — encoder 0 (BN-fused reparameterized)
vad_16k.conv1.bias             [128]
vad_16k.conv2.weight           [64,  3, 128]   — encoder 1 (stride 2)
vad_16k.conv2.bias             [64]
vad_16k.conv3.weight           [64,  3, 64]    — encoder 2 (stride 2)
vad_16k.conv3.bias             [64]
vad_16k.conv4.weight           [128, 3, 64]    — encoder 3
vad_16k.conv4.bias             [128]
vad_16k.lstm.Wx                [512, 128]      — input gate weights, [4·H, D] gate order i,f,g,o
vad_16k.lstm.Wh                [512, 128]      — hidden gate weights
vad_16k.lstm.bias              [512]           — fused bias_ih + bias_hh
vad_16k.final_conv.weight      [1, 1, 128]
vad_16k.final_conv.bias        [1]

All Conv1d weights are stored in MLX channels-last layout [O, K, I]. LSTM bias is the sum of PyTorch's bias_ih + bias_hh (single-tensor MLX convention).

Files

model.safetensors — MLX-format weights (16 kHz branch, vad_16k.* layout). Same weights serve both inference modes.
config.json — model metadata + branch parameters
convert.py — reproducible PyTorch → MLX conversion script
example.py — 32ms streaming inference example (per-chunk decisions; live mic / streaming use cases)
example_256ms.py — 256ms unified inference example (8 internal chunks per call with noisy-OR aggregation; faster wall time for offline ASR preprocessing)

Conversion

The bundled convert.py produces this repo's model.safetensors from the upstream PyPI silero_vad package:

uv pip install silero-vad safetensors numpy
python convert.py --output model.safetensors

What it does:

Loads PyTorch state_dict via silero_vad.load_silero_vad()
Drops the 8 kHz branch (this repo ships 16 kHz only)
Transposes Conv1d weights [O, I, K] → [O, K, I] (MLX channels-last)
Sums LSTM bias_ih + bias_hh → single [4H] bias
Maps PyTorch keys to mlx-audio's vad_16k.* convention
Saves as safetensors with metadata

LSTM gate ordering is i, f, g, o along the [4H] axis — the same in both PyTorch and MLX, so weights pass through unchanged.

Usage

Quick start (Python + MLX)

uv pip install mlx safetensors numpy huggingface_hub

# 32ms streaming (live mic, per-chunk decision)
uv run python example.py /path/to/audio_16k_mono.wav

# 256ms unified (offline batch / ASR preprocessing — ~1.7× faster wall)
uv run python example_256ms.py /path/to/audio_16k_mono.wav

Both examples read weights directly from this repo via huggingface_hub. The same model.safetensors serves both modes; only the inference loop differs.

Choosing a mode

	32ms streaming (`example.py`)	256ms unified (`example_256ms.py`)
Per-call output	1 probability per 32 ms	1 probability per 256 ms (8 internal chunks aggregated via noisy-OR)
Decision latency	32 ms	256 ms
MLX wall throughput	1×	~1.7× faster (fewer `mx.eval` barriers)
Best for	Live microphone, real-time gating	Offline ASR preprocessing, batch / file-based VAD

Manual loading

import mlx.core as mx
from huggingface_hub import hf_hub_download
from safetensors.numpy import load_file

path = hf_hub_download("mlx-community/silero-vad-v6", "model.safetensors")
weights = load_file(path)
weights = {k: mx.array(v) for k, v in weights.items()}
# weights["vad_16k.stft_conv.weight"].shape == (258, 256, 1)

For the full forward pass, see example.py (≈ 80 lines, no external dependencies beyond mlx, numpy, safetensors, huggingface_hub).

Streaming protocol

Per-chunk inputs (32 ms at 16 kHz):

64 context samples carried from the previous chunk's tail
512 new audio samples
LSTM h, c state ([1, 128] each) carried across chunks

Output: scalar speech probability ∈ [0, 1]. A standard threshold of 0.5 works well on most material; tune via config.json::threshold for your use case.

Quality

Frame-level F1 against VibeVoice ASR segment-derived speech labels on a 44-minute English meeting clip (playback-eng-16k.wav, 83,190 chunks at 32 ms resolution, GT speech ratio 98.9%):

Threshold	v6 (this repo) F1	v5 (mlx-community/silero-vad) F1	Δ
0.30	0.8656	0.8692	+0.004 v5
0.40	0.8612	0.8649	+0.004 v5
0.50	0.8572	0.8607	+0.003 v5
0.60	0.8534	0.8561	+0.003 v5
0.70	0.8492	0.8510	+0.002 v5

At threshold 0.5: precision ≈ 0.998 for both, recall 0.751 vs 0.757. The two versions are essentially equivalent within measurement noise on this sample. The v5 edge of ~0.4% F1 is too small relative to single-sample variance, GT labeling granularity (segment-level rather than word-level), and the high class imbalance to draw a generalised "v5 is better" conclusion.

A broader multi-domain quality comparison (clean / noisy / far-field / multilingual) would be needed for a definitive ranking.

Performance

Bit-exact parity with the upstream PyTorch JIT model (max|Δ| = 0.0 across 83K chunks of test audio).

32ms streaming, per-chunk single-call latency (M1 Max):

Backend	p50 / chunk
MLX (CPU stream + `mx.compile`)	0.75 ms
MLX (GPU naive, no compile)	1.46 ms

On newer hardware (M5 Max), per-chunk async-batched throughput drops to ~0.17 ms/chunk in CPU stream mode (≈ 187× real-time), per lucasnewman's PR701 benchmark.

256ms unified vs 32ms streaming, offline VAD on 10-min English meeting (M1 Max):

Mode	VAD wall time	Speedup vs 32ms
32ms streaming MLX	~36 s	1.0×
256ms unified MLX	~16 s	~2.3× faster
CoreML 256ms (FluidInference, native Apple HW)	~12 s	~3.0× faster

The 256ms unified mode is the right choice for offline ASR preprocessing; its eight internal 32ms chunks per outer call amortise MLX dispatch overhead through mx.eval barrier reduction. CoreML 256ms remains the absolute fastest on Apple Silicon (ANE/BNNS-tuned), but pure-MLX 256ms closes ~50% of the gap.

License

MIT (matching the upstream Silero VAD license).

Citation

@misc{silero_vad_2024,
  author = {Silero Team},
  title  = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector},
  year   = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}}
}

Acknowledgments

Silero Team — original model
Apple MLX — runtime
mlx-community/silero-vad — v5 port that established the layout convention used here

Downloads last month: 181

Safetensors

Model size

309k params

Tensor type

F32

MLX

Hardware compatibility

Quantized