TEN-VAD — GGML

GGML format conversion of TEN-framework/ten-vad, a lightweight Voice Activity Detection model. This is the only GGML implementation of TEN-VAD that we're aware of.

Conversion scripts and Zig reference implementation: danielbodart/ten-vad-ggml

Model Details

Property	Value
Model size	296 KB
Format	GGML (FP32, no quantization)
Weight tensors	21
Architecture	Sep Conv → LSTM → Dense → Sigmoid
Input	256 audio samples (16ms @ 16kHz)
Output	Speech probability [0, 1]

Architecture

256 samples (16ms @ 16kHz)
  → Pre-emphasis (0.97) → STFT (Hann 768, FFT 1024) → 40-band Mel + Pitch
  → Z-normalize → Context [3 × 41]
  → SepConv2D(1→16) → MaxPool → SepConv1D(16→16) × 2 → Flatten [80]
  → LSTM(80→64) → LSTM(64→64) → Concat [128]
  → Dense(128→32, ReLU) → Dense(32→1, Sigmoid)
  → probability

Parameters

Audio Preprocessing

Parameter	Value
Sample rate	16,000 Hz
Frame size	256 samples (16ms)
Pre-emphasis	0.97
FFT window	Hann, 768 samples
FFT size	1024
Mel bands	40 (0–8 kHz, Slaney)
Features	41 (40 mel + 1 pitch), Z-normalized
Context	3 frames

Model Hyperparameters

Parameter	Value
Sep conv layers	3 (2D + 1D + 1D)
LSTM layers	2 (hidden dim 64)
LSTM layer 0 input	80
Dense layers	128→32 (ReLU), 32→1 (Sigmoid)
LSTM state reset	Every 1875 frames (30s)

Recommended Thresholds

Parameter	Value
Onset (speech starts)	0.6
Offset (speech continues)	0.5
Min silence	1000 ms

GGML Binary Format

Magic: 0x67676d6c ("ggml")
Type:  "ten-vad"
Version: 1.0.0
Hyperparams: n_sep_conv=3, n_lstm=2, hidden=64, lstm1_in=80, lstm2_in=64,
             dense1_in=128, dense1_out=32, dense2_out=1

21 tensors (F32):
  sep_conv_{0,1,2}_{dw,pw,bias}           × 9
  lstm_{0,1}_{ih_weight,hh_weight,ih_bias,hh_bias}  × 8
  dense_{0,1}_{weight,bias}                × 4

LSTM gates stored in PyTorch order (i/f/g/o), converted from ONNX order (i/o/f/c).

Usage

# Download
hf download danielbodart/ten-vad-ggml ten-vad-ggml.bin --local-dir .

Files

File	Size	Description
`ten-vad-ggml.bin`	296 KB	GGML model binary

Origin

Converted from the official TEN-VAD ONNX model using convert-ten-vad-to-ggml.py from the ten-vad-ggml GitHub repo. Verified against native libten_vad.so — produces identical speech probabilities (max delta < 0.001).

Originally developed for capsper, a push-to-talk voice dictation tool for Linux.

License

MIT (conversion scripts). Original model: TEN-framework/ten-vad.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support