VibeVoice Quantization Guide
Successfully quantized VibeVoice 7B model to both 4-bit and 8-bit versions using bitsandbytes!
Model Sizes
| Model Version | Size | Memory Usage | Quality |
|---|---|---|---|
| Original (fp16/bf16) | 18GB | ~18GB VRAM | Best |
| 8-bit Quantized | 9.9GB | ~10.6GB VRAM | Excellent |
| 4-bit Quantized (nf4) | 6.2GB | ~6.6GB VRAM | Very Good |
How to Use Pre-Quantized Models
1. Loading 4-bit Model
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
# Load pre-quantized 4-bit model
model_path = "/path/to/VibeVoice-Large-4bit"
processor = VibeVoiceProcessor.from_pretrained(model_path)
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
model_path,
device_map='cuda',
torch_dtype=torch.bfloat16,
)
2. Loading 8-bit Model
# Same code, just point to 8-bit model
model_path = "/path/to/VibeVoice-Large-8bit"
# ... rest is the same
Creating Your Own Quantized Models
Use the provided script to quantize models:
# 4-bit quantization (nf4)
python quantize_and_save_vibevoice.py \
--model_path /path/to/original/model \
--output_dir /path/to/output/4bit \
--bits 4 \
--test
# 8-bit quantization
python quantize_and_save_vibevoice.py \
--model_path /path/to/original/model \
--output_dir /path/to/output/8bit \
--bits 8 \
--test
Benefits
- Pre-quantized models load faster - No on-the-fly quantization needed
- Lower VRAM requirements - 4-bit uses only ~6.6GB vs 18GB
- Shareable - Upload the quantized folder to share with others
- Quality preserved - nf4 quantization maintains excellent output quality
Distribution
To share quantized models:
- Upload the entire quantized model directory (e.g.,
VibeVoice-Large-4bit/) - Include the
quantization_config.jsonfile (automatically created) - Users can load directly without any quantization setup
Performance Notes
- 4-bit (nf4): Best for memory-constrained systems, minimal quality loss
- 8-bit: Better quality than 4-bit, still significant memory savings
- Both versions maintain the same generation speed as the original
- Flash Attention 2 is supported in all quantized versions
Troubleshooting
If loading fails:
- Ensure you have
bitsandbytesinstalled:pip install bitsandbytes - Make sure you're on a CUDA-capable GPU
- Check that all model files are present in the directory
Files Created
Each quantized model directory contains:
model.safetensors.*- Quantized model weightsconfig.json- Model configuration with quantization settingsquantization_config.json- Specific quantization parametersprocessor/- Audio processor filesload_quantized_Xbit.py- Example loading script