|
|
--- |
|
|
pipeline_tag: voice-activity-detection |
|
|
license: bsd-2-clause |
|
|
tags: |
|
|
- speech-processing |
|
|
- semantic-vad |
|
|
- multimodal |
|
|
- video |
|
|
--- |
|
|
# Smart Turn Multimodal |
|
|
|
|
|
**Smart Turn Multimodal** is a multimodal extension of Pipecat's Smart Turn that combines audio and video to predict whether a speaker has finished their turn. Visual cues (mouth movement, gaze) help disambiguate pauses that are ambiguous in audio alone. |
|
|
|
|
|
## Links |
|
|
|
|
|
* [Blog post: Smart Turn Multimodal](https://susurobo.jp/blog/smart_turn_multimodal.html) |
|
|
* [GitHub repo](https://github.com/susurobo/smart-turn-multimodal) with training and inference code |
|
|
* Original audio-only [Smart Turn v3](https://huggingface.co/pipecat-ai/smart-turn-v3) |
|
|
|
|
|
## Model architecture |
|
|
|
|
|
* **Audio branch:** Whisper Tiny encoder (8s context) with cross-attention pooling → 384-dim embedding |
|
|
* **Video branch:** R3D-18 (Kinetics-400 pretrained) processing last 32 frames (~1s) → 256-dim embedding |
|
|
* **Fusion:** Late fusion via concatenation + linear projection back to 384-dim |
|
|
* Params: ~20M total |
|
|
* Checkpoint: ONNX available |
|
|
|
|
|
## Audio-only fallback |
|
|
|
|
|
When video is unavailable, pass `None` for `pixel_values`. The model uses a zero tensor internally, falling back to audio-only behavior—no code changes required. |
|
|
|
|
|
## How to use |
|
|
|
|
|
```python |
|
|
from inference_multimodal import predict_endpoint |
|
|
|
|
|
result = predict_endpoint(audio_array, video_path="clip.mp4") |
|
|
# result = {"prediction": 1, "probability": 0.92} |
|
|
|
|
|
# Audio-only fallback |
|
|
result = predict_endpoint(audio_array, video_path=None) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Dataset variety:** Currently trained on Meta's [Casual Conversations dataset](https://ai.meta.com/datasets/casual-conversations-dataset/) (mostly unscripted monologues). Generalization to diverse conversation styles is still being validated. |
|
|
- **VAD-triggered:** Model is activated by VAD-detected silence, not predictive of turn endings before silence occurs. |
|
|
|
|
|
## Thanks |
|
|
|
|
|
Thank you to Pipecat for the original Smart Turn model and to Meta for the Casual Conversations dataset. |
|
|
|