Instructions to use cstr/vibevoice-1.5b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VibeVoice
How to use cstr/vibevoice-1.5b-GGUF with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("cstr/vibevoice-1.5b-GGUF") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "cstr/vibevoice-1.5b-GGUF", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
VibeVoice-1.5B GGUF
GGUF conversion of microsoft/VibeVoice-1.5B for use with CrispASR.
This is the base model (not the streaming variant). It supports voice cloning from audio samples and multi-speaker synthesis.
Model variants
| File | Quant | Size | Notes |
|---|---|---|---|
vibevoice-1.5b-tts-f16.gguf |
F16 | 5.1 GB | Full precision |
vibevoice-1.5b-tts-q8_0.gguf |
Q8_0 | 2.8 GB | Near-lossless |
vibevoice-1.5b-tts-q4_k.gguf |
Q4_K | 1.6 GB | Smallest, perfect ASR round-trip |
Usage
Requires a voice reference audio (WAV file, 24 kHz mono) for voice cloning:
# Voice cloning TTS
VIBEVOICE_VOICE_AUDIO=reference_voice.wav \
crispasr --tts "Hello, how are you today?" \
-m vibevoice-1.5b-tts-q4_k.gguf \
--tts-output output.wav
Architecture
Single-LM architecture (differs from the streaming Realtime-0.5B):
- LM: Qwen2.5-1.5B (d=1536, 28 layers, 12 heads, 2 KV heads)
- Prediction head: 4 AdaLN + SwiGLU layers (d=1536)
- Acoustic encoder: 7-stage ConvNeXt (3200x downsample from 24kHz)
- Semantic encoder: same architecture, 128-dim latent
- Decoder: 7-stage transposed ConvNeXt (3200x upsample)
- DPM-Solver++: 20-step, cosine schedule, v-prediction
The model generates speech tokens autoregressively โ the LM produces
<|vision_pad|> (speech_diffusion) tokens that trigger diffusion sampling,
with <|vision_start|> / <|vision_end|> as control tokens.
Quality
| Input | Parakeet ASR |
|---|---|
| "Hello, how are you today?" | "Hello, how are you today?" |
Differences from Realtime-0.5B
| Feature | Realtime-0.5B | 1.5B Base |
|---|---|---|
| Architecture | 4L base + 20L TTS LM | Single 28L LM |
| Voice input | Pre-computed .pt prompts | Audio WAV files |
| Voice cloning | No (fixed presets) | Yes (from reference audio) |
| Multi-speaker | No | Yes (up to 4 speakers) |
| Streaming | Yes | No |
License
MIT (same as original model).
- Downloads last month
- 404
Hardware compatibility
Log In to add your hardware
8-bit
16-bit
Model tree for cstr/vibevoice-1.5b-GGUF
Base model
microsoft/VibeVoice-1.5B