GGUF + pure-C++ runtime in CrispASR — VibeVoice-ASR (7B) on a 16 GB box
We've added VibeVoice-ASR to CrispASR as the vibevoice backend. C++ binary, GGUF — no Python, no transformers.
src/vibevoice.cpp — two parallel CNN tokenizer encoders (acoustic σ-VAE ConvNeXt + semantic encoder) projected to LM space via FC connectors, then a 7B Qwen2.5 autoregressive decoder. Both encoders + the Qwen LM run as ggml graphs with KV-cached decode + flash attention.
Two non-obvious things during the port:
- Prompt template is prescriptive. Three special tokens (
<|object_ref_start|>151646,<|box_start|>151648,<|object_ref_end|>151647) which the HF processor namesaudio_bos_token/audio_token/audio_eos_token. The HF processor callsapply_chat_templatewith the assistant header required — without it, the model goes off-task within a few tokens. Verified against the upstream microsoft/VibeVoice repo (LEARNINGS.md "VibeVoice-ASR prompt template verification"). max_genparameter. Earlier we ran every call up tomax_gen=512instead of stopping at EOS — that's a 50× overrun on short clips. Fixed by honouring the LM's EOS token IDs.
The full F16 GGUF is 14.9 GB; Q4_K is 4.5 GB and runs comfortably on a 16 GB box with the encoder weights kept in F32 for precision.
Pre-quantised GGUFs (MIT): cstr/vibevoice-asr-GGUF
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend vibevoice -m vibevoice-asr-q4_k.gguf -f audio.wav -osrt
Native diarisation, timestamps, and hotword biasing all wired. Companion VibeVoice TTS GGUFs at cstr/vibevoice-realtime-0.5b-GGUF and cstr/vibevoice-1.5b-GGUF.