Instructions to use ncoder-ai/VibeVoice-Large-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ncoder-ai/VibeVoice-Large-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="ncoder-ai/VibeVoice-Large-AWQ")# Load model directly from transformers import VibeVoiceForConditionalGenerationInference model = VibeVoiceForConditionalGenerationInference.from_pretrained("ncoder-ai/VibeVoice-Large-AWQ", dtype="auto") - VibeVoice
How to use ncoder-ai/VibeVoice-Large-AWQ with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("ncoder-ai/VibeVoice-Large-AWQ") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "ncoder-ai/VibeVoice-Large-AWQ", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
VibeVoice-Large-AWQ โ drop-in AWQ-INT4 quantization
Drop-in replacement for
rsxdalv/VibeVoice-Large. Qwen2-7B language model is quantized to AWQ-INT4 with Marlin GEMM kernels. The audio tokenizer + diffusion head stay FP16. Single repo, single download, standardfrom_pretrainedโ no graft step.
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
import torch
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
"ncoder-ai/VibeVoice-Large-AWQ",
torch_dtype=torch.float16,
device_map="cuda:0",
attn_implementation="sdpa",
).eval()
processor = VibeVoiceProcessor.from_pretrained("ncoder-ai/VibeVoice-Large-AWQ")
That's it. The quantization_config in config.json tells transformers to
swap the Qwen2 linear layers for AWQ at load time; everything else is FP16.
Why AWQ over the alternatives
VibeVoice's 7B language model dominates VRAM and inference time. Quantizing only that component keeps audio quality untouched while shrinking memory and (on most GPUs) actually speeding things up because Marlin INT4 has less memory traffic.
| Metric | FP16 baseline | bnb-Q8 (FabioSarracino) | AWQ-INT4 (this) |
|---|---|---|---|
| VRAM | 17.41 GB | 10.84 GB | 8.42 GB |
| RTF (i7-14700KF, 5 steps) | 0.509 | 0.860 | 0.457 |
| RTF (i5-12600K, 7 steps) | 0.54 | 1.220 | 0.699 |
Both numbers measured on RTX 3090. The Marlin INT4 kernel is fast enough that AWQ-INT4 beats FP16 on the same hardware while bnb-Q8's per-call dispatch overhead makes it 50% slower on the slower CPU.
Audio quality A/B-tested on multi-speaker scenes (4-speaker council scene + 4- speaker contemporary scene) at 7 inference steps โ no audible difference from FP16.
Calibration
Calibrated on 256 chat-style prompts (mix of long-form narration, dialog
attributions, multi-speaker scripts) using auto-awq with:
- 4-bit, group_size=128, GEMM version, zero_point=True
- Marlin kernel for inference (auto-selected by AutoAWQ on Ampere+)
The audio components (acoustic_tokenizer, semantic_tokenizer, prediction_head,
acoustic_connector, semantic_connector) are excluded via
modules_to_not_convert, so they load in FP16 from the same checkpoint.
Usage with the official VibeVoice library
pip install transformers torch accelerate auto-awq soundfile
pip install git+https://github.com/microsoft/VibeVoice.git
The model is loaded the same way as the FP16 version:
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
import torch
MODEL = "ncoder-ai/VibeVoice-Large-AWQ"
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
MODEL, torch_dtype=torch.float16, device_map="cuda:0", attn_implementation="sdpa"
).eval()
processor = VibeVoiceProcessor.from_pretrained(MODEL)
# 7 inference steps โ sweet spot for AWQ on RTX 3090 (5 steps = thinner audio)
model.set_ddpm_inference_steps(num_steps=7)
inputs = processor(
text=["Speaker 1: Hello, this is the AWQ-quantized VibeVoice."],
voice_samples=[["path/to/voice_sample.wav"]],
padding=True, return_tensors="pt", return_attention_mask=True,
).to("cuda:0")
with torch.inference_mode():
out = model.generate(
**inputs, tokenizer=processor.tokenizer,
cfg_scale=1.3, generation_config={"do_sample": False},
verbose=False, refresh_negative=True,
)
audio = out.speech_outputs[0].cpu().float().numpy().squeeze()
import soundfile as sf
sf.write("output.wav", audio, 24000)
Drop-in replacements
This model also works in:
- VibeVoice-FastAPI โ set
VIBEVOICE_MODEL_PATH=ncoder-ai/VibeVoice-Large-AWQand start the server. No other config needed. - VibeVoice-awq-engine โ Python package that wraps this model with helpers for streaming, voice cloning, and multi-speaker scripts.
Hardware
- Required: NVIDIA GPU with compute capability โฅ 7.5 (Turing or newer) for Marlin INT4 kernels
- Recommended: 12 GB+ VRAM
- Tested: RTX 3090 (24 GB), RTX 4070 Ti (12 GB)
License
MIT โ same as the upstream rsxdalv/VibeVoice-Large. Not affiliated with Microsoft Research.
- Downloads last month
- 97
Model tree for ncoder-ai/VibeVoice-Large-AWQ
Base model
rsxdalv/VibeVoice-Large