Contextual VAD
contextual-vad is a small scikit-learn classifier that predicts higher-level voice-agent events from derived VAD, STT, TTS, and dialogue-state features.
It is designed to sit on top of a streaming STT + LLM + TTS pipeline:
audio VAD + streaming STT partials + assistant/TTS state + dialogue context
-> contextual-vad
-> event probabilities
-> deterministic turn-taking policy
It produces probabilities for:
listeningspeech_startedendpoint_candidateturn_committeduser_resumedinterruption_startedinterruption_confirmedbackchannel_detectedfalse_alarm
Training source: synthetic_bootstrap:n=12000:seed=7
Validation accuracy: 0.9938
Intended Use
Use this as a bootstrap policy model. Replace the synthetic bootstrap data with real call-frame logs before production use.
The model is meant to complement, not replace, acoustic VAD:
- acoustic VAD provides fast speech-start/speech-stop signals
- STT partials provide semantic hints about whether a thought is complete
- TTS/assistant state helps distinguish real user barge-in from echo or backchannels
- this model estimates the event probabilities that a state machine can turn into product behavior
Features
Input features include:
- VAD probability, speech duration, silence duration, and energy
- STT confidence, stable transcript length, partial transcript length, and word counts
- semantic hints such as continuation endings and whether required slots are filled
- assistant/TTS state, including whether the assistant is speaking and estimated echo risk
- dialogue context such as expected answer type
Research Reference
This model is inspired by Voice Activity Projection (VAP):
Erik Ekstedt and Gabriel Skantze, "Voice Activity Projection: Self-supervised Learning of Turn-taking Events", Interspeech 2022.
- Paper: https://arxiv.org/abs/2205.09812
- PDF: https://arxiv.org/pdf/2205.09812
- Reference implementation: https://github.com/ErikEkstedt/VoiceActivityProjection
VAP's key idea is to predict future voice activity and turn-taking events instead of relying only on current speech/non-speech detection. contextual-vad applies that design direction to a practical STT + LLM + TTS wrapper using lightweight tabular features.
Files
turn_event_model.joblib: scikit-learn pipeline.training_data.csv: training rows used for this artifact.feature_schema.json: feature names and defaults.metrics.json: validation metrics.example_payload.json: one valid inference payload.