Instructions to use cstr/VibeVoice-7B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VibeVoice
How to use cstr/VibeVoice-7B-GGUF with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("cstr/VibeVoice-7B-GGUF") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "cstr/VibeVoice-7B-GGUF", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
VibeVoice-7B โ GGUF
GGUF conversions of microsoft/VibeVoice-7B for use with the crispasr CLI from CrispStrobe/CrispASR.
VibeVoice-7B is the largest model in Microsoft's VibeVoice family โ a 9.3B-parameter speech-LLM (Qwen2.5-7B decoder + dual ฯ-VAE encoders) with state-of-the-art ASR quality across 50+ languages.
- 60-minute long-form audio in a single forward pass
- Built-in speaker diarization โ Speaker IDs per segment
- Word-level timestamps
- Hotword / context injection via
--context - 50+ languages with automatic language detection
- MIT licence
Update (April 2026): Now includes both ASR (encoder + LM) and TTS (ฯ-VAE decoder + prediction head). TTS requires โฅQ4_K for good quality โ Q3_K is too aggressive for the decoder. For faster/smaller TTS, use VibeVoice-Realtime-0.5B or VibeVoice-1.5B.
Files
| File | Size | Notes |
|---|---|---|
vibevoice-7b-q3_k.gguf |
4.7 GB | Q3_K โ ASR only (TTS quality too low) |
vibevoice-7b-q4_0.gguf |
5.6 GB | Q4_0 โ fast decode |
vibevoice-7b-q4_k.gguf |
5.8 GB | Q4_K โ recommended default (ASR + TTS) |
vibevoice-7b-q5_k.gguf |
6.8 GB | Q5_K โ higher quality |
vibevoice-7b-q6_k.gguf |
7.9 GB | Q6_K โ near-lossless |
vibevoice-7b-q8_0.gguf |
9.8 GB | Q8_0 โ reference quality |
vibevoice-7b-f16.gguf |
17.4 GB | F16 โ full precision |
Quick Start
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL=ON # macOS
cmake --build build -j$(nproc)
# 2. Download the quantised GGUF
huggingface-cli download cstr/VibeVoice-7B-GGUF \
vibevoice-7b-q4_k.gguf --local-dir .
# 3. Transcribe
./build/bin/crispasr --model vibevoice-7b-q4_k.gguf \
--file audio.wav --backend vibevoice
Audio is automatically resampled from 16 kHz to 24 kHz by the backend.
Architecture
| Component | Details |
|---|---|
| LM decoder | Qwen2.5-7B (28 layers, d=3584, 28/4 heads, GQA) |
| Acoustic encoder | 7-stage ConvNeXt ฯ-VAE, 3200ร downsample |
| Semantic encoder | 7-stage ConvNeXt ฯ-VAE, 3200ร downsample |
| Connectors | FC1 โ RMSNorm โ FC2 (acoustic + semantic) |
| Prediction head | 4-layer DiT with AdaLN modulation |
| Total parameters | ~9.3B |
| Input | 24 kHz mono PCM |
| Tokenizer | Qwen2.5 BPE (152064 tokens, embedded in GGUF) |
Hardware Requirements
| Quantization | RAM (approx) | Notes |
|---|---|---|
| Q3_K | ~6 GB | Minimum for inference |
| Q4_K | ~7 GB | Recommended |
| Q8_0 | ~11 GB | High quality |
| F16 | ~18 GB | Full precision |
GPU acceleration (Metal/CUDA) is strongly recommended for the 7B model. CPU-only inference is very slow (~10ร slower than realtime).
Conversion
Converted from microsoft/VibeVoice-7B safetensors using the streaming memory-mapped converter:
python3 models/convert-vibevoice-stream-gguf.py \
--input microsoft/VibeVoice-7B \
--output vibevoice-7b-f16.gguf
# Quantize
./build/bin/crispasr-quantize vibevoice-7b-f16.gguf vibevoice-7b-q4_k.gguf q4_k
The streaming converter (convert-vibevoice-stream-gguf.py) uses memory-mapped tensor access to avoid loading the full 19 GB model into RAM.
- Downloads last month
- 1,883
4-bit
6-bit
8-bit
16-bit
Model tree for cstr/VibeVoice-7B-GGUF
Base model
microsoft/VibeVoice-Large