Contextual VAD

contextual-vad is a small scikit-learn classifier that predicts higher-level voice-agent events from derived VAD, STT, TTS, and dialogue-state features.

It is designed to sit on top of a streaming STT + LLM + TTS pipeline:

audio VAD + streaming STT partials + assistant/TTS state + dialogue context
  -> contextual-vad
  -> event probabilities
  -> deterministic turn-taking policy

It produces probabilities for:

listening
speech_started
endpoint_candidate
turn_committed
user_resumed
interruption_started
interruption_confirmed
backchannel_detected
false_alarm

Training source: synthetic_bootstrap:n=12000:seed=7

Validation accuracy: 0.9938

Intended Use

Use this as a bootstrap policy model. Replace the synthetic bootstrap data with real call-frame logs before production use.

The model is meant to complement, not replace, acoustic VAD:

acoustic VAD provides fast speech-start/speech-stop signals
STT partials provide semantic hints about whether a thought is complete
TTS/assistant state helps distinguish real user barge-in from echo or backchannels
this model estimates the event probabilities that a state machine can turn into product behavior

Features

Input features include:

VAD probability, speech duration, silence duration, and energy
STT confidence, stable transcript length, partial transcript length, and word counts
semantic hints such as continuation endings and whether required slots are filled
assistant/TTS state, including whether the assistant is speaking and estimated echo risk
dialogue context such as expected answer type

Research Reference

This model is inspired by Voice Activity Projection (VAP):

Erik Ekstedt and Gabriel Skantze, "Voice Activity Projection: Self-supervised Learning of Turn-taking Events", Interspeech 2022.

Paper: https://arxiv.org/abs/2205.09812
PDF: https://arxiv.org/pdf/2205.09812
Reference implementation: https://github.com/ErikEkstedt/VoiceActivityProjection

VAP's key idea is to predict future voice activity and turn-taking events instead of relying only on current speech/non-speech detection. contextual-vad applies that design direction to a practical STT + LLM + TTS wrapper using lightweight tabular features.

Files

turn_event_model.joblib: scikit-learn pipeline.
training_data.csv: training rows used for this artifact.
feature_schema.json: feature names and defaults.
metrics.json: validation metrics.
example_payload.json: one valid inference payload.

Downloads last month: -; Downloads are not tracked for this model. How to track

Space using somukandula/contextual-vad 1

Paper for somukandula/contextual-vad

Voice Activity Projection: Self-supervised Learning of Turn-taking Events

Paper • 2205.09812 • Published May 19, 2022