Reduce to a redirect to ncoder-ai/VibeVoice-Large-AWQ

c4a4c28 verified about 1 month ago

2.41 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-to-speech
	- tts
	- vibevoice
	- awq
	- int4
	- quantized
	base_model: rsxdalv/VibeVoice-Large
	base_model_relation: quantized
	library_name: transformers
	pipeline_tag: text-to-speech
	---

	# Use [`ncoder-ai/VibeVoice-Large-AWQ`](https://huggingface.co/ncoder-ai/VibeVoice-Large-AWQ) instead

	This repo holds the AWQ-INT4 Qwen2 LLM weights only, in isolation. It exists
	so the AWQ-quantized LLM can be composed by hand with a custom VibeVoice base
	(e.g. a fork, a fine-tune, or a different audio stack).

	You almost certainly want the unified drop-in instead: [`ncoder-ai/VibeVoice-Large-AWQ`](https://huggingface.co/ncoder-ai/VibeVoice-Large-AWQ).

	That repo bundles the same AWQ-INT4 LLM with FP16 audio components into one
	checkpoint — `transformers.from_pretrained()` loads it directly, no manual
	graft step. Same speed, same VRAM (~8.4 GB), same audio quality.

	```python
	from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
	import torch

	model = VibeVoiceForConditionalGenerationInference.from_pretrained(
	"ncoder-ai/VibeVoice-Large-AWQ",
	torch_dtype=torch.float16,
	device_map="cuda:0",
	).eval()
	```

	---

	## If you really need the LLM-only weights

	For advanced users hand-grafting the AWQ Qwen2 into a custom base. You provide
	the FP16 audio stack (acoustic tokenizer, diffusion head, connectors); this
	repo provides only the quantized language model.

	```python
	import torch, gc
	from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
	from awq import AutoAWQForCausalLM

	# Your custom FP16 base
	model = VibeVoiceForConditionalGenerationInference.from_pretrained(
	"rsxdalv/VibeVoice-Large", torch_dtype=torch.float16, device_map="cuda:0",
	).eval()

	# Free FP16 LLM, graft AWQ Qwen2 in its place
	del model.model.language_model
	gc.collect(); torch.cuda.empty_cache()

	awq = AutoAWQForCausalLM.from_quantized(
	"ncoder-ai/VibeVoice-Large-AWQ-INT4",
	device_map={"": 0}, safetensors=True, fuse_layers=False,
	)
	model.model.language_model = awq.model.model
	del awq; gc.collect(); torch.cuda.empty_cache()
	```

	Quantization recipe: AutoAWQ, 4-bit, group_size=128, GEMM (Marlin) version,
	zero_point=True. Calibration: 250 samples (200 prose + 50 wikitext), 512-token
	max length.

	## License

	MIT — same as upstream `rsxdalv/VibeVoice-Large`.