Smart Turn Multimodal

Smart Turn Multimodal is a multimodal extension of Pipecat's Smart Turn that combines audio and video to predict whether a speaker has finished their turn. Visual cues (mouth movement, gaze) help disambiguate pauses that are ambiguous in audio alone.

Model architecture

Audio branch: Whisper Tiny encoder (8s context) with cross-attention pooling → 384-dim embedding
Video branch: R3D-18 (Kinetics-400 pretrained) processing last 32 frames (~1s) → 256-dim embedding
Fusion: Late fusion via concatenation + linear projection back to 384-dim
Params: ~20M total
Checkpoint: ONNX available

Audio-only fallback

When video is unavailable, pass None for pixel_values. The model uses a zero tensor internally, falling back to audio-only behavior—no code changes required.

How to use

from inference_multimodal import predict_endpoint

result = predict_endpoint(audio_array, video_path="clip.mp4")
# result = {"prediction": 1, "probability": 0.92}

# Audio-only fallback
result = predict_endpoint(audio_array, video_path=None)

Limitations

Dataset variety: Currently trained on Meta's Casual Conversations dataset (mostly unscripted monologues). Generalization to diverse conversation styles is still being validated.
VAD-triggered: Model is activated by VAD-detected silence, not predictive of turn endings before silence occurs.

Thanks

Thank you to Pipecat for the original Smart Turn model and to Meta for the Casual Conversations dataset.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support