susuROBO
/

smart-turn-multimodal

Voice Activity Detection

speech-processing

Model card Files Files and versions

smart-turn-multimodal / README.md

maxipesfix's picture

Added blog post link

0bda02c verified 7 days ago

|

history blame contribute delete

2.04 kB

	---
	pipeline_tag: voice-activity-detection
	license: bsd-2-clause
	tags:
	- speech-processing
	- semantic-vad
	- multimodal
	- video
	---
	# Smart Turn Multimodal

	Smart Turn Multimodal is a multimodal extension of Pipecat's Smart Turn that combines audio and video to predict whether a speaker has finished their turn. Visual cues (mouth movement, gaze) help disambiguate pauses that are ambiguous in audio alone.

	## Links

	* [Blog post: Smart Turn Multimodal](https://susurobo.jp/blog/smart_turn_multimodal.html)
	* [GitHub repo](https://github.com/susurobo/smart-turn-multimodal) with training and inference code
	* Original audio-only [Smart Turn v3](https://huggingface.co/pipecat-ai/smart-turn-v3)

	## Model architecture

	* Audio branch: Whisper Tiny encoder (8s context) with cross-attention pooling → 384-dim embedding
	* Video branch: R3D-18 (Kinetics-400 pretrained) processing last 32 frames (~1s) → 256-dim embedding
	* Fusion: Late fusion via concatenation + linear projection back to 384-dim
	* Params: ~20M total
	* Checkpoint: ONNX available

	## Audio-only fallback

	When video is unavailable, pass `None` for `pixel_values`. The model uses a zero tensor internally, falling back to audio-only behavior—no code changes required.

	## How to use

	```python
	from inference_multimodal import predict_endpoint

	result = predict_endpoint(audio_array, video_path="clip.mp4")
	# result = {"prediction": 1, "probability": 0.92}

	# Audio-only fallback
	result = predict_endpoint(audio_array, video_path=None)
	```

	## Limitations

	- Dataset variety: Currently trained on Meta's [Casual Conversations dataset](https://ai.meta.com/datasets/casual-conversations-dataset/) (mostly unscripted monologues). Generalization to diverse conversation styles is still being validated.
	- VAD-triggered: Model is activated by VAD-detected silence, not predictive of turn endings before silence occurs.

	## Thanks

	Thank you to Pipecat for the original Smart Turn model and to Meta for the Casual Conversations dataset.