TEN-VAD β€” GGML

GGML format conversion of TEN-framework/ten-vad, a lightweight Voice Activity Detection model. This is the only GGML implementation of TEN-VAD that we're aware of.

Conversion scripts and Zig reference implementation: danielbodart/ten-vad-ggml

Model Details

Property Value
Model size 296 KB
Format GGML (FP32, no quantization)
Weight tensors 21
Architecture Sep Conv β†’ LSTM β†’ Dense β†’ Sigmoid
Input 256 audio samples (16ms @ 16kHz)
Output Speech probability [0, 1]

Architecture

256 samples (16ms @ 16kHz)
  β†’ Pre-emphasis (0.97) β†’ STFT (Hann 768, FFT 1024) β†’ 40-band Mel + Pitch
  β†’ Z-normalize β†’ Context [3 Γ— 41]
  β†’ SepConv2D(1β†’16) β†’ MaxPool β†’ SepConv1D(16β†’16) Γ— 2 β†’ Flatten [80]
  β†’ LSTM(80β†’64) β†’ LSTM(64β†’64) β†’ Concat [128]
  β†’ Dense(128β†’32, ReLU) β†’ Dense(32β†’1, Sigmoid)
  β†’ probability

Parameters

Audio Preprocessing

Parameter Value
Sample rate 16,000 Hz
Frame size 256 samples (16ms)
Pre-emphasis 0.97
FFT window Hann, 768 samples
FFT size 1024
Mel bands 40 (0–8 kHz, Slaney)
Features 41 (40 mel + 1 pitch), Z-normalized
Context 3 frames

Model Hyperparameters

Parameter Value
Sep conv layers 3 (2D + 1D + 1D)
LSTM layers 2 (hidden dim 64)
LSTM layer 0 input 80
Dense layers 128β†’32 (ReLU), 32β†’1 (Sigmoid)
LSTM state reset Every 1875 frames (30s)

Recommended Thresholds

Parameter Value
Onset (speech starts) 0.6
Offset (speech continues) 0.5
Min silence 1000 ms

GGML Binary Format

Magic: 0x67676d6c ("ggml")
Type:  "ten-vad"
Version: 1.0.0
Hyperparams: n_sep_conv=3, n_lstm=2, hidden=64, lstm1_in=80, lstm2_in=64,
             dense1_in=128, dense1_out=32, dense2_out=1

21 tensors (F32):
  sep_conv_{0,1,2}_{dw,pw,bias}           Γ— 9
  lstm_{0,1}_{ih_weight,hh_weight,ih_bias,hh_bias}  Γ— 8
  dense_{0,1}_{weight,bias}                Γ— 4

LSTM gates stored in PyTorch order (i/f/g/o), converted from ONNX order (i/o/f/c).

Usage

# Download
hf download danielbodart/ten-vad-ggml ten-vad-ggml.bin --local-dir .

Files

File Size Description
ten-vad-ggml.bin 296 KB GGML model binary

Origin

Converted from the official TEN-VAD ONNX model using convert-ten-vad-to-ggml.py from the ten-vad-ggml GitHub repo. Verified against native libten_vad.so β€” produces identical speech probabilities (max delta < 0.001).

Originally developed for capsper, a push-to-talk voice dictation tool for Linux.

License

MIT (conversion scripts). Original model: TEN-framework/ten-vad.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support