smart-turn-multimodal / README.md

maxipesfix

Added blog post link

0bda02c verified 6 days ago

preview code

raw

history blame contribute delete

2.04 kB

metadata

pipeline_tag: voice-activity-detection
license: bsd-2-clause
tags:
  - speech-processing
  - semantic-vad
  - multimodal
  - video

Smart Turn Multimodal

Smart Turn Multimodal is a multimodal extension of Pipecat's Smart Turn that combines audio and video to predict whether a speaker has finished their turn. Visual cues (mouth movement, gaze) help disambiguate pauses that are ambiguous in audio alone.

Model architecture

Audio branch: Whisper Tiny encoder (8s context) with cross-attention pooling → 384-dim embedding
Video branch: R3D-18 (Kinetics-400 pretrained) processing last 32 frames (~1s) → 256-dim embedding
Fusion: Late fusion via concatenation + linear projection back to 384-dim
Params: ~20M total
Checkpoint: ONNX available

Audio-only fallback

When video is unavailable, pass None for pixel_values. The model uses a zero tensor internally, falling back to audio-only behavior—no code changes required.

How to use

from inference_multimodal import predict_endpoint

result = predict_endpoint(audio_array, video_path="clip.mp4")
# result = {"prediction": 1, "probability": 0.92}

# Audio-only fallback
result = predict_endpoint(audio_array, video_path=None)

Limitations

Dataset variety: Currently trained on Meta's Casual Conversations dataset (mostly unscripted monologues). Generalization to diverse conversation styles is still being validated.
VAD-triggered: Model is activated by VAD-detected silence, not predictive of turn endings before silence occurs.

Thanks

Thank you to Pipecat for the original Smart Turn model and to Meta for the Casual Conversations dataset.

susuROBO
/

smart-turn-multimodal