GGUF + pure-C++ runtime in CrispASR — VibeVoice-1.5B (TTS with voice cloning)

#51
by cstr - opened

We've added VibeVoice-1.5B to CrispASR as the vibevoice-tts backend. C++ binary, GGUF — no Python.

Architecturally distinct from the Realtime-0.5B sibling: single-LM autoregressive TTS (no separate TTS LM — 28-layer Qwen2 decoder) with voice cloning from an acoustic+semantic encoder applied to a reference WAV (so the realtime model's "pre-baked KV cache" trick doesn't apply here — provide --voice-ref).

A few practical points:

  • Speech token IDs reuse vision tokens (151654 / 151652 / 151653). Easy to confuse if you're cross-referencing with the realtime variant.
  • TTS validation policy: every TTS output gets ASR-roundtripped. Peak/RMS gates are necessary but insufficient — VibeVoice-Realtime-0.5B passed peak/RMS while sounding "noisy / crackling" (issue #39); only the ASR round-trip and per-frame cos against the reference caught it.
  • 25 voicepacks shipped under cstr/ (extracted from the upstream voicepack archive).

Pre-quantised GGUFs (MIT): cstr/vibevoice-1.5b-GGUF

./build/bin/crispasr --backend vibevoice-tts \
    -m vibevoice-1.5b-q4_k.gguf \
    --tts "Hello world" --voice-ref alice.wav --tts-output out.wav

Realtime sibling: cstr/vibevoice-realtime-0.5b-GGUF. Companion ASR: cstr/vibevoice-asr-GGUF.

Sign up or log in to comment