TEN-VAD β GGML
GGML format conversion of TEN-framework/ten-vad, a lightweight Voice Activity Detection model. This is the only GGML implementation of TEN-VAD that we're aware of.
Conversion scripts and Zig reference implementation: danielbodart/ten-vad-ggml
Model Details
| Property | Value |
|---|---|
| Model size | 296 KB |
| Format | GGML (FP32, no quantization) |
| Weight tensors | 21 |
| Architecture | Sep Conv β LSTM β Dense β Sigmoid |
| Input | 256 audio samples (16ms @ 16kHz) |
| Output | Speech probability [0, 1] |
Architecture
256 samples (16ms @ 16kHz)
β Pre-emphasis (0.97) β STFT (Hann 768, FFT 1024) β 40-band Mel + Pitch
β Z-normalize β Context [3 Γ 41]
β SepConv2D(1β16) β MaxPool β SepConv1D(16β16) Γ 2 β Flatten [80]
β LSTM(80β64) β LSTM(64β64) β Concat [128]
β Dense(128β32, ReLU) β Dense(32β1, Sigmoid)
β probability
Parameters
Audio Preprocessing
| Parameter | Value |
|---|---|
| Sample rate | 16,000 Hz |
| Frame size | 256 samples (16ms) |
| Pre-emphasis | 0.97 |
| FFT window | Hann, 768 samples |
| FFT size | 1024 |
| Mel bands | 40 (0β8 kHz, Slaney) |
| Features | 41 (40 mel + 1 pitch), Z-normalized |
| Context | 3 frames |
Model Hyperparameters
| Parameter | Value |
|---|---|
| Sep conv layers | 3 (2D + 1D + 1D) |
| LSTM layers | 2 (hidden dim 64) |
| LSTM layer 0 input | 80 |
| Dense layers | 128β32 (ReLU), 32β1 (Sigmoid) |
| LSTM state reset | Every 1875 frames (30s) |
Recommended Thresholds
| Parameter | Value |
|---|---|
| Onset (speech starts) | 0.6 |
| Offset (speech continues) | 0.5 |
| Min silence | 1000 ms |
GGML Binary Format
Magic: 0x67676d6c ("ggml")
Type: "ten-vad"
Version: 1.0.0
Hyperparams: n_sep_conv=3, n_lstm=2, hidden=64, lstm1_in=80, lstm2_in=64,
dense1_in=128, dense1_out=32, dense2_out=1
21 tensors (F32):
sep_conv_{0,1,2}_{dw,pw,bias} Γ 9
lstm_{0,1}_{ih_weight,hh_weight,ih_bias,hh_bias} Γ 8
dense_{0,1}_{weight,bias} Γ 4
LSTM gates stored in PyTorch order (i/f/g/o), converted from ONNX order (i/o/f/c).
Usage
# Download
hf download danielbodart/ten-vad-ggml ten-vad-ggml.bin --local-dir .
Files
| File | Size | Description |
|---|---|---|
ten-vad-ggml.bin |
296 KB | GGML model binary |
Origin
Converted from the official TEN-VAD ONNX model using convert-ten-vad-to-ggml.py from the ten-vad-ggml GitHub repo. Verified against native libten_vad.so β produces identical speech probabilities (max delta < 0.001).
Originally developed for capsper, a push-to-talk voice dictation tool for Linux.
License
MIT (conversion scripts). Original model: TEN-framework/ten-vad.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support