Smart Turn Multimodal
Smart Turn Multimodal is a multimodal extension of Pipecat's Smart Turn that combines audio and video to predict whether a speaker has finished their turn. Visual cues (mouth movement, gaze) help disambiguate pauses that are ambiguous in audio alone.
Links
- Blog post: Smart Turn Multimodal
- GitHub repo with training and inference code
- Original audio-only Smart Turn v3
Model architecture
- Audio branch: Whisper Tiny encoder (8s context) with cross-attention pooling → 384-dim embedding
- Video branch: R3D-18 (Kinetics-400 pretrained) processing last 32 frames (~1s) → 256-dim embedding
- Fusion: Late fusion via concatenation + linear projection back to 384-dim
- Params: ~20M total
- Checkpoint: ONNX available
Audio-only fallback
When video is unavailable, pass None for pixel_values. The model uses a zero tensor internally, falling back to audio-only behavior—no code changes required.
How to use
from inference_multimodal import predict_endpoint
result = predict_endpoint(audio_array, video_path="clip.mp4")
# result = {"prediction": 1, "probability": 0.92}
# Audio-only fallback
result = predict_endpoint(audio_array, video_path=None)
Limitations
- Dataset variety: Currently trained on Meta's Casual Conversations dataset (mostly unscripted monologues). Generalization to diverse conversation styles is still being validated.
- VAD-triggered: Model is activated by VAD-detected silence, not predictive of turn endings before silence occurs.
Thanks
Thank you to Pipecat for the original Smart Turn model and to Meta for the Casual Conversations dataset.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support