Drop See-also link to the LLM-only repo (it's just a back-end now)

1197c28 verified 20 days ago

4.98 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-to-speech
	- tts
	- vibevoice
	- awq
	- int4
	- quantized
	base_model: rsxdalv/VibeVoice-Large
	base_model_relation: quantized
	library_name: transformers
	pipeline_tag: text-to-speech
	---

	# VibeVoice-Large-AWQ — drop-in AWQ-INT4 quantization

	> Drop-in replacement for [`rsxdalv/VibeVoice-Large`](https://huggingface.co/rsxdalv/VibeVoice-Large).
	> Qwen2-7B language model is quantized to AWQ-INT4 with Marlin GEMM kernels.
	> The audio tokenizer + diffusion head stay FP16. Single repo, single download,
	> standard `from_pretrained` — no graft step.

	```python
	from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
	from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
	import torch

	model = VibeVoiceForConditionalGenerationInference.from_pretrained(
	"ncoder-ai/VibeVoice-Large-AWQ",
	torch_dtype=torch.float16,
	device_map="cuda:0",
	attn_implementation="sdpa",
	).eval()
	processor = VibeVoiceProcessor.from_pretrained("ncoder-ai/VibeVoice-Large-AWQ")
	```

	That's it. The `quantization_config` in `config.json` tells `transformers` to
	swap the Qwen2 linear layers for AWQ at load time; everything else is FP16.

	## Why AWQ over the alternatives

	VibeVoice's 7B language model dominates VRAM and inference time. Quantizing only
	that component keeps audio quality untouched while shrinking memory and (on most
	GPUs) actually speeding things up because Marlin INT4 has less memory traffic.

	\| Metric \| FP16 baseline \| bnb-Q8 ([FabioSarracino](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8)) \| AWQ-INT4 (this) \|
	\|------------------------------\|--------------:\|------------------:\|----------------:\|
	\| VRAM \| 17.41 GB \| 10.84 GB \| 8.42 GB \|
	\| RTF (i7-14700KF, 5 steps) \| 0.509 \| 0.860 \| 0.457 \|
	\| RTF (i5-12600K, 7 steps) \| 0.54 \| 1.220 \| 0.699 \|

	Both numbers measured on RTX 3090. The Marlin INT4 kernel is fast enough that
	AWQ-INT4 beats FP16 on the same hardware while bnb-Q8's per-call dispatch
	overhead makes it 50% slower on the slower CPU.

	Audio quality A/B-tested on multi-speaker scenes (4-speaker council scene + 4-
	speaker contemporary scene) at 7 inference steps — no audible difference from
	FP16.

	## Calibration

	Calibrated on 256 chat-style prompts (mix of long-form narration, dialog
	attributions, multi-speaker scripts) using `auto-awq` with:
	- 4-bit, group_size=128, GEMM version, zero_point=True
	- Marlin kernel for inference (auto-selected by AutoAWQ on Ampere+)

	The audio components (acoustic_tokenizer, semantic_tokenizer, prediction_head,
	acoustic_connector, semantic_connector) are excluded via
	`modules_to_not_convert`, so they load in FP16 from the same checkpoint.

	## Usage with the official VibeVoice library

	```bash
	pip install transformers torch accelerate auto-awq soundfile
	pip install git+https://github.com/microsoft/VibeVoice.git
	```

	The model is loaded the same way as the FP16 version:

	```python
	from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
	from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
	import torch

	MODEL = "ncoder-ai/VibeVoice-Large-AWQ"
	model = VibeVoiceForConditionalGenerationInference.from_pretrained(
	MODEL, torch_dtype=torch.float16, device_map="cuda:0", attn_implementation="sdpa"
	).eval()
	processor = VibeVoiceProcessor.from_pretrained(MODEL)

	# 7 inference steps — sweet spot for AWQ on RTX 3090 (5 steps = thinner audio)
	model.set_ddpm_inference_steps(num_steps=7)

	inputs = processor(
	text=["Speaker 1: Hello, this is the AWQ-quantized VibeVoice."],
	voice_samples=[["path/to/voice_sample.wav"]],
	padding=True, return_tensors="pt", return_attention_mask=True,
	).to("cuda:0")

	with torch.inference_mode():
	out = model.generate(
	**inputs, tokenizer=processor.tokenizer,
	cfg_scale=1.3, generation_config={"do_sample": False},
	verbose=False, refresh_negative=True,
	)

	audio = out.speech_outputs[0].cpu().float().numpy().squeeze()
	import soundfile as sf
	sf.write("output.wav", audio, 24000)
	```

	## Drop-in replacements

	This model also works in:

	- [VibeVoice-FastAPI](https://github.com/ncoder-ai/VibeVoice-FastAPI) — set `VIBEVOICE_MODEL_PATH=ncoder-ai/VibeVoice-Large-AWQ` and start the server. No other config needed.
	- [VibeVoice-awq-engine](https://github.com/ncoder-ai/VibeVoice-awq-engine) — Python package that wraps this model with helpers for streaming, voice cloning, and multi-speaker scripts.

	## Hardware

	- Required: NVIDIA GPU with compute capability ≥ 7.5 (Turing or newer) for Marlin INT4 kernels
	- Recommended: 12 GB+ VRAM
	- Tested: RTX 3090 (24 GB), RTX 4070 Ti (12 GB)

	## License

	MIT — same as the upstream `rsxdalv/VibeVoice-Large`. Not affiliated with Microsoft Research.