Silero VAD v5 โ€” MLX

MLX-compatible weights for Silero VAD v5, converted from the official JIT model.

Model

Silero VAD v5 is a lightweight (~309K params) voice activity detection model that processes 512-sample chunks (32ms @ 16kHz) with sub-millisecond latency. It outputs a speech probability between 0 and 1 for each chunk, with LSTM state carried across chunks for streaming operation.

Architecture: STFT โ†’ 4ร—Conv1d+ReLU encoder โ†’ LSTM(128) โ†’ Conv1d decoder โ†’ sigmoid

Usage (Swift / MLX)

import SpeechVAD

// Load model
let vad = try await SileroVADModel.fromPretrained()

// Streaming: process 512-sample chunks
let prob = vad.processChunk(samples)  // โ†’ 0.0...1.0

// Batch: detect speech segments in complete audio
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for seg in segments {
    print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}

Part of qwen3-asr-swift.

Conversion

python3 scripts/convert_silero_vad.py --upload

Converts the official Silero VAD v5 JIT model via torch.hub, transposes Conv1d weights for MLX channels-last format, sums LSTM biases (bias_ih + bias_hh), and saves as safetensors.

Weight Mapping

JIT Key MLX Key Shape
_model.stft.forward_basis_buffer stft.weight [258, 256, 1]
_model.encoder.{i}.reparam_conv.weight encoder.{i}.weight varies
_model.encoder.{i}.reparam_conv.bias encoder.{i}.bias varies
_model.decoder.rnn.weight_ih lstm.Wx [512, 128]
_model.decoder.rnn.weight_hh lstm.Wh [512, 128]
_model.decoder.rnn.bias_ih + bias_hh lstm.bias [512]
_model.decoder.decoder.2.weight decoder.weight [1, 1, 128]
_model.decoder.decoder.2.bias decoder.bias [1]

License

The original Silero VAD model is released under the MIT License.

Downloads last month
12
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support