VibeVoice-7B โ€” GGUF

GGUF conversions of microsoft/VibeVoice-7B for use with the crispasr CLI from CrispStrobe/CrispASR.

VibeVoice-7B is the largest model in Microsoft's VibeVoice family โ€” a 9.3B-parameter speech-LLM (Qwen2.5-7B decoder + dual ฯƒ-VAE encoders) with state-of-the-art ASR quality across 50+ languages.

  • 60-minute long-form audio in a single forward pass
  • Built-in speaker diarization โ€” Speaker IDs per segment
  • Word-level timestamps
  • Hotword / context injection via --context
  • 50+ languages with automatic language detection
  • MIT licence

Update (April 2026): Now includes both ASR (encoder + LM) and TTS (ฯƒ-VAE decoder + prediction head). TTS requires โ‰ฅQ4_K for good quality โ€” Q3_K is too aggressive for the decoder. For faster/smaller TTS, use VibeVoice-Realtime-0.5B or VibeVoice-1.5B.

Files

File Size Notes
vibevoice-7b-q3_k.gguf 4.7 GB Q3_K โ€” ASR only (TTS quality too low)
vibevoice-7b-q4_0.gguf 5.6 GB Q4_0 โ€” fast decode
vibevoice-7b-q4_k.gguf 5.8 GB Q4_K โ€” recommended default (ASR + TTS)
vibevoice-7b-q5_k.gguf 6.8 GB Q5_K โ€” higher quality
vibevoice-7b-q6_k.gguf 7.9 GB Q6_K โ€” near-lossless
vibevoice-7b-q8_0.gguf 9.8 GB Q8_0 โ€” reference quality
vibevoice-7b-f16.gguf 17.4 GB F16 โ€” full precision

Quick Start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL=ON   # macOS
cmake --build build -j$(nproc)

# 2. Download the quantised GGUF
huggingface-cli download cstr/VibeVoice-7B-GGUF \
    vibevoice-7b-q4_k.gguf --local-dir .

# 3. Transcribe
./build/bin/crispasr --model vibevoice-7b-q4_k.gguf \
    --file audio.wav --backend vibevoice

Audio is automatically resampled from 16 kHz to 24 kHz by the backend.

Architecture

Component Details
LM decoder Qwen2.5-7B (28 layers, d=3584, 28/4 heads, GQA)
Acoustic encoder 7-stage ConvNeXt ฯƒ-VAE, 3200ร— downsample
Semantic encoder 7-stage ConvNeXt ฯƒ-VAE, 3200ร— downsample
Connectors FC1 โ†’ RMSNorm โ†’ FC2 (acoustic + semantic)
Prediction head 4-layer DiT with AdaLN modulation
Total parameters ~9.3B
Input 24 kHz mono PCM
Tokenizer Qwen2.5 BPE (152064 tokens, embedded in GGUF)

Hardware Requirements

Quantization RAM (approx) Notes
Q3_K ~6 GB Minimum for inference
Q4_K ~7 GB Recommended
Q8_0 ~11 GB High quality
F16 ~18 GB Full precision

GPU acceleration (Metal/CUDA) is strongly recommended for the 7B model. CPU-only inference is very slow (~10ร— slower than realtime).

Conversion

Converted from microsoft/VibeVoice-7B safetensors using the streaming memory-mapped converter:

python3 models/convert-vibevoice-stream-gguf.py \
    --input microsoft/VibeVoice-7B \
    --output vibevoice-7b-f16.gguf

# Quantize
./build/bin/crispasr-quantize vibevoice-7b-f16.gguf vibevoice-7b-q4_k.gguf q4_k

The streaming converter (convert-vibevoice-stream-gguf.py) uses memory-mapped tensor access to avoid loading the full 19 GB model into RAM.

Downloads last month
1,883
GGUF
Model size
9B params
Architecture
vibevoice-asr
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/VibeVoice-7B-GGUF

Quantized
(1)
this model