VibeVoice Quantization Guide

Successfully quantized VibeVoice 7B model to both 4-bit and 8-bit versions using bitsandbytes!

Model Sizes

Model Version	Size	Memory Usage	Quality
Original (fp16/bf16)	18GB	~18GB VRAM	Best
8-bit Quantized	9.9GB	~10.6GB VRAM	Excellent
4-bit Quantized (nf4)	6.2GB	~6.6GB VRAM	Very Good

How to Use Pre-Quantized Models

1. Loading 4-bit Model

from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor

# Load pre-quantized 4-bit model
model_path = "/path/to/VibeVoice-Large-4bit"
processor = VibeVoiceProcessor.from_pretrained(model_path)
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    model_path,
    device_map='cuda',
    torch_dtype=torch.bfloat16,
)

2. Loading 8-bit Model

# Same code, just point to 8-bit model
model_path = "/path/to/VibeVoice-Large-8bit"
# ... rest is the same

Creating Your Own Quantized Models

Use the provided script to quantize models:

# 4-bit quantization (nf4)
python quantize_and_save_vibevoice.py \
    --model_path /path/to/original/model \
    --output_dir /path/to/output/4bit \
    --bits 4 \
    --test

# 8-bit quantization
python quantize_and_save_vibevoice.py \
    --model_path /path/to/original/model \
    --output_dir /path/to/output/8bit \
    --bits 8 \
    --test

Benefits

Pre-quantized models load faster - No on-the-fly quantization needed
Lower VRAM requirements - 4-bit uses only ~6.6GB vs 18GB
Shareable - Upload the quantized folder to share with others
Quality preserved - nf4 quantization maintains excellent output quality

Distribution

To share quantized models:

Upload the entire quantized model directory (e.g., VibeVoice-Large-4bit/)
Include the quantization_config.json file (automatically created)
Users can load directly without any quantization setup

Performance Notes

4-bit (nf4): Best for memory-constrained systems, minimal quality loss
8-bit: Better quality than 4-bit, still significant memory savings
Both versions maintain the same generation speed as the original
Flash Attention 2 is supported in all quantized versions

Troubleshooting

If loading fails:

Ensure you have bitsandbytes installed: pip install bitsandbytes
Make sure you're on a CUDA-capable GPU
Check that all model files are present in the directory

Files Created

Each quantized model directory contains:

model.safetensors.* - Quantized model weights
config.json - Model configuration with quantization settings
quantization_config.json - Specific quantization parameters
processor/ - Audio processor files
load_quantized_Xbit.py - Example loading script