Silero VAD v5 โ MLX
MLX-compatible weights for Silero VAD v5, converted from the official JIT model.
Model
Silero VAD v5 is a lightweight (~309K params) voice activity detection model that processes 512-sample chunks (32ms @ 16kHz) with sub-millisecond latency. It outputs a speech probability between 0 and 1 for each chunk, with LSTM state carried across chunks for streaming operation.
Architecture: STFT โ 4รConv1d+ReLU encoder โ LSTM(128) โ Conv1d decoder โ sigmoid
Usage (Swift / MLX)
import SpeechVAD
// Load model
let vad = try await SileroVADModel.fromPretrained()
// Streaming: process 512-sample chunks
let prob = vad.processChunk(samples) // โ 0.0...1.0
// Batch: detect speech segments in complete audio
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for seg in segments {
print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}
Part of qwen3-asr-swift.
Conversion
python3 scripts/convert_silero_vad.py --upload
Converts the official Silero VAD v5 JIT model via torch.hub, transposes Conv1d weights for MLX channels-last format, sums LSTM biases (bias_ih + bias_hh), and saves as safetensors.
Weight Mapping
| JIT Key | MLX Key | Shape |
|---|---|---|
_model.stft.forward_basis_buffer |
stft.weight |
[258, 256, 1] |
_model.encoder.{i}.reparam_conv.weight |
encoder.{i}.weight |
varies |
_model.encoder.{i}.reparam_conv.bias |
encoder.{i}.bias |
varies |
_model.decoder.rnn.weight_ih |
lstm.Wx |
[512, 128] |
_model.decoder.rnn.weight_hh |
lstm.Wh |
[512, 128] |
_model.decoder.rnn.bias_ih + bias_hh |
lstm.bias |
[512] |
_model.decoder.decoder.2.weight |
decoder.weight |
[1, 1, 128] |
_model.decoder.decoder.2.bias |
decoder.bias |
[1] |
License
The original Silero VAD model is released under the MIT License.
- Downloads last month
- 12
Hardware compatibility
Log In to add your hardware
Quantized
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support