Instructions to use ncoder-ai/VibeVoice-Large-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ncoder-ai/VibeVoice-Large-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="ncoder-ai/VibeVoice-Large-AWQ")# Load model directly from transformers import VibeVoiceForConditionalGenerationInference model = VibeVoiceForConditionalGenerationInference.from_pretrained("ncoder-ai/VibeVoice-Large-AWQ", dtype="auto") - VibeVoice
How to use ncoder-ai/VibeVoice-Large-AWQ with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("ncoder-ai/VibeVoice-Large-AWQ") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "ncoder-ai/VibeVoice-Large-AWQ", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| tags: | |
| - text-to-speech | |
| - tts | |
| - vibevoice | |
| - awq | |
| - int4 | |
| - quantized | |
| base_model: rsxdalv/VibeVoice-Large | |
| base_model_relation: quantized | |
| library_name: transformers | |
| pipeline_tag: text-to-speech | |
| # VibeVoice-Large-AWQ β drop-in AWQ-INT4 quantization | |
| > **Drop-in replacement for [`rsxdalv/VibeVoice-Large`](https://huggingface.co/rsxdalv/VibeVoice-Large)**. | |
| > Qwen2-7B language model is quantized to AWQ-INT4 with Marlin GEMM kernels. | |
| > The audio tokenizer + diffusion head stay FP16. Single repo, single download, | |
| > standard `from_pretrained` β no graft step. | |
| ```python | |
| from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference | |
| from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor | |
| import torch | |
| model = VibeVoiceForConditionalGenerationInference.from_pretrained( | |
| "ncoder-ai/VibeVoice-Large-AWQ", | |
| torch_dtype=torch.float16, | |
| device_map="cuda:0", | |
| attn_implementation="sdpa", | |
| ).eval() | |
| processor = VibeVoiceProcessor.from_pretrained("ncoder-ai/VibeVoice-Large-AWQ") | |
| ``` | |
| That's it. The `quantization_config` in `config.json` tells `transformers` to | |
| swap the Qwen2 linear layers for AWQ at load time; everything else is FP16. | |
| ## Why AWQ over the alternatives | |
| VibeVoice's 7B language model dominates VRAM and inference time. Quantizing only | |
| that component keeps audio quality untouched while shrinking memory and (on most | |
| GPUs) actually speeding things up because Marlin INT4 has less memory traffic. | |
| | Metric | FP16 baseline | bnb-Q8 ([FabioSarracino](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8)) | **AWQ-INT4 (this)** | | |
| |------------------------------|--------------:|------------------:|----------------:| | |
| | VRAM | 17.41 GB | 10.84 GB | **8.42 GB** | | |
| | RTF (i7-14700KF, 5 steps) | 0.509 | 0.860 | **0.457** | | |
| | RTF (i5-12600K, 7 steps) | 0.54 | 1.220 | **0.699** | | |
| Both numbers measured on RTX 3090. The Marlin INT4 kernel is fast enough that | |
| AWQ-INT4 beats FP16 on the same hardware while bnb-Q8's per-call dispatch | |
| overhead makes it 50% slower on the slower CPU. | |
| Audio quality A/B-tested on multi-speaker scenes (4-speaker council scene + 4- | |
| speaker contemporary scene) at 7 inference steps β no audible difference from | |
| FP16. | |
| ## Calibration | |
| Calibrated on 256 chat-style prompts (mix of long-form narration, dialog | |
| attributions, multi-speaker scripts) using `auto-awq` with: | |
| - 4-bit, group_size=128, GEMM version, zero_point=True | |
| - Marlin kernel for inference (auto-selected by AutoAWQ on Ampere+) | |
| The audio components (acoustic_tokenizer, semantic_tokenizer, prediction_head, | |
| acoustic_connector, semantic_connector) are excluded via | |
| `modules_to_not_convert`, so they load in FP16 from the same checkpoint. | |
| ## Usage with the official VibeVoice library | |
| ```bash | |
| pip install transformers torch accelerate auto-awq soundfile | |
| pip install git+https://github.com/microsoft/VibeVoice.git | |
| ``` | |
| The model is loaded the same way as the FP16 version: | |
| ```python | |
| from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference | |
| from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor | |
| import torch | |
| MODEL = "ncoder-ai/VibeVoice-Large-AWQ" | |
| model = VibeVoiceForConditionalGenerationInference.from_pretrained( | |
| MODEL, torch_dtype=torch.float16, device_map="cuda:0", attn_implementation="sdpa" | |
| ).eval() | |
| processor = VibeVoiceProcessor.from_pretrained(MODEL) | |
| # 7 inference steps β sweet spot for AWQ on RTX 3090 (5 steps = thinner audio) | |
| model.set_ddpm_inference_steps(num_steps=7) | |
| inputs = processor( | |
| text=["Speaker 1: Hello, this is the AWQ-quantized VibeVoice."], | |
| voice_samples=[["path/to/voice_sample.wav"]], | |
| padding=True, return_tensors="pt", return_attention_mask=True, | |
| ).to("cuda:0") | |
| with torch.inference_mode(): | |
| out = model.generate( | |
| **inputs, tokenizer=processor.tokenizer, | |
| cfg_scale=1.3, generation_config={"do_sample": False}, | |
| verbose=False, refresh_negative=True, | |
| ) | |
| audio = out.speech_outputs[0].cpu().float().numpy().squeeze() | |
| import soundfile as sf | |
| sf.write("output.wav", audio, 24000) | |
| ``` | |
| ## Drop-in replacements | |
| This model also works in: | |
| - **[VibeVoice-FastAPI](https://github.com/ncoder-ai/VibeVoice-FastAPI)** β set `VIBEVOICE_MODEL_PATH=ncoder-ai/VibeVoice-Large-AWQ` and start the server. No other config needed. | |
| - **[VibeVoice-awq-engine](https://github.com/ncoder-ai/VibeVoice-awq-engine)** β Python package that wraps this model with helpers for streaming, voice cloning, and multi-speaker scripts. | |
| ## Hardware | |
| - **Required**: NVIDIA GPU with compute capability β₯ 7.5 (Turing or newer) for Marlin INT4 kernels | |
| - **Recommended**: 12 GB+ VRAM | |
| - **Tested**: RTX 3090 (24 GB), RTX 4070 Ti (12 GB) | |
| ## License | |
| MIT β same as the upstream `rsxdalv/VibeVoice-Large`. Not affiliated with Microsoft Research. | |