Instructions to use microsoft/VibeVoice-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/VibeVoice-1.5B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="microsoft/VibeVoice-1.5B")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("microsoft/VibeVoice-1.5B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
GGUF + pure-C++ runtime in CrispASR — VibeVoice-1.5B (TTS with voice cloning)
#51
by cstr - opened
We've added VibeVoice-1.5B to CrispASR as the vibevoice-tts backend. C++ binary, GGUF — no Python.
Architecturally distinct from the Realtime-0.5B sibling: single-LM autoregressive TTS (no separate TTS LM — 28-layer Qwen2 decoder) with voice cloning from an acoustic+semantic encoder applied to a reference WAV (so the realtime model's "pre-baked KV cache" trick doesn't apply here — provide --voice-ref).
A few practical points:
- Speech token IDs reuse vision tokens (151654 / 151652 / 151653). Easy to confuse if you're cross-referencing with the realtime variant.
- TTS validation policy: every TTS output gets ASR-roundtripped. Peak/RMS gates are necessary but insufficient — VibeVoice-Realtime-0.5B passed peak/RMS while sounding "noisy / crackling" (issue #39); only the ASR round-trip and per-frame cos against the reference caught it.
- 25 voicepacks shipped under
cstr/(extracted from the upstream voicepack archive).
Pre-quantised GGUFs (MIT): cstr/vibevoice-1.5b-GGUF
./build/bin/crispasr --backend vibevoice-tts \
-m vibevoice-1.5b-q4_k.gguf \
--tts "Hello world" --voice-ref alice.wav --tts-output out.wav
Realtime sibling: cstr/vibevoice-realtime-0.5b-GGUF. Companion ASR: cstr/vibevoice-asr-GGUF.