GGUF + pure-C++ runtime in CrispASR — VibeVoice-ASR (7B) on a 16 GB box

#25
by cstr - opened

We've added VibeVoice-ASR to CrispASR as the vibevoice backend. C++ binary, GGUF — no Python, no transformers.

src/vibevoice.cpp — two parallel CNN tokenizer encoders (acoustic σ-VAE ConvNeXt + semantic encoder) projected to LM space via FC connectors, then a 7B Qwen2.5 autoregressive decoder. Both encoders + the Qwen LM run as ggml graphs with KV-cached decode + flash attention.

Two non-obvious things during the port:

  1. Prompt template is prescriptive. Three special tokens (<|object_ref_start|> 151646, <|box_start|> 151648, <|object_ref_end|> 151647) which the HF processor names audio_bos_token / audio_token / audio_eos_token. The HF processor calls apply_chat_template with the assistant header required — without it, the model goes off-task within a few tokens. Verified against the upstream microsoft/VibeVoice repo (LEARNINGS.md "VibeVoice-ASR prompt template verification").
  2. max_gen parameter. Earlier we ran every call up to max_gen=512 instead of stopping at EOS — that's a 50× overrun on short clips. Fixed by honouring the LM's EOS token IDs.

The full F16 GGUF is 14.9 GB; Q4_K is 4.5 GB and runs comfortably on a 16 GB box with the encoder weights kept in F32 for precision.

Pre-quantised GGUFs (MIT): cstr/vibevoice-asr-GGUF

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend vibevoice -m vibevoice-asr-q4_k.gguf -f audio.wav -osrt

Native diarisation, timestamps, and hotword biasing all wired. Companion VibeVoice TTS GGUFs at cstr/vibevoice-realtime-0.5b-GGUF and cstr/vibevoice-1.5b-GGUF.

Sign up or log in to comment