Instructions to use microsoft/VibeVoice-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/VibeVoice-1.5B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="microsoft/VibeVoice-1.5B")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("microsoft/VibeVoice-1.5B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Is it limited to producing a single track?
#21
by BigDeeper - opened
Not clear from the card, if it is possible to produce separate tracks for different speakers, with appropriate silences to allow "others" to speak?
Currently this isn’t supported. All speakers are rendered into a single audio track, rather than separate tracks.
Yes, all voices are currently mixed into a single track. If you’d like to separate them, we recommend using post-processing techniques such as VAD and diarization to manually split the generated audio.
Currently this isn’t supported. All speakers are rendered into a single audio track, rather than separate tracks.
I noted that Layer 0/1 dimension 609 seems to be the relevant activation.

