Smart Turn Multimodal

Smart Turn Multimodal is a multimodal extension of Pipecat's Smart Turn that combines audio and video to predict whether a speaker has finished their turn. Visual cues (mouth movement, gaze) help disambiguate pauses that are ambiguous in audio alone.

Links

Model architecture

  • Audio branch: Whisper Tiny encoder (8s context) with cross-attention pooling → 384-dim embedding
  • Video branch: R3D-18 (Kinetics-400 pretrained) processing last 32 frames (~1s) → 256-dim embedding
  • Fusion: Late fusion via concatenation + linear projection back to 384-dim
  • Params: ~20M total
  • Checkpoint: ONNX available

Audio-only fallback

When video is unavailable, pass None for pixel_values. The model uses a zero tensor internally, falling back to audio-only behavior—no code changes required.

How to use

from inference_multimodal import predict_endpoint

result = predict_endpoint(audio_array, video_path="clip.mp4")
# result = {"prediction": 1, "probability": 0.92}

# Audio-only fallback
result = predict_endpoint(audio_array, video_path=None)

Limitations

  • Dataset variety: Currently trained on Meta's Casual Conversations dataset (mostly unscripted monologues). Generalization to diverse conversation styles is still being validated.
  • VAD-triggered: Model is activated by VAD-detected silence, not predictive of turn endings before silence occurs.

Thanks

Thank you to Pipecat for the original Smart Turn model and to Meta for the Casual Conversations dataset.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support