marksverdhai's picture
Upload README.md with huggingface_hub
398e33f verified
metadata
license: mit
language:
  - en
  - zh
tags:
  - text-to-speech
  - tts
  - speech-synthesis
  - vibevoice
  - bitsandbytes
  - 4bit
  - quantized
library_name: transformers
base_model: vibevoice/VibeVoice-7B
pipeline_tag: text-to-speech

VibeVoice 7B - 4-bit Quantized (bitsandbytes NF4)

This is a 4-bit quantized version of VibeVoice 7B using bitsandbytes NF4 quantization.

Model Details

Property Value
Base Model vibevoice/VibeVoice-7B
Quantization bitsandbytes NF4 (4-bit)
VRAM Usage ~6.2 GB
Model Size ~6.2 GB on disk
Sample Rate 24kHz

VRAM Comparison

Mode VRAM Reduction
Full bfloat16 ~17 GB baseline
ao-int8 ~9.4 GB 45%
bnb-4bit ~6.2 GB 64%

Quick Start

Installation

pip install transformers bitsandbytes torch torchaudio
pip install git+https://github.com/vibevoice-community/VibeVoice.git

Usage

import torch
from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor

# Load quantized model
model_id = "marksverdhai/vibevoice-7b-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    model_id,
    device_map={"": 0},  # Load on GPU 0
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)
processor = VibeVoiceProcessor.from_pretrained(model_id)

model.eval()
model.set_ddpm_inference_steps(num_steps=10)

# Generate speech
text = "Speaker 1: Hello! This is VibeVoice, a state-of-the-art text-to-speech model."

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=None,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        generation_config={"do_sample": False},
        verbose=False,
        is_prefill=False,
    )

# Get audio
audio = outputs.speech_outputs[0].squeeze().cpu()
sample_rate = 24000

# Save to file
import torchaudio
torchaudio.save("output.wav", audio.unsqueeze(0), sample_rate)

Voice Cloning

# With voice reference
inputs = processor(
    text=["Speaker 1: Hello, I can clone any voice!"],
    voice_samples=[["path/to/reference.wav"]],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        is_prefill=True,  # Enable voice cloning
    )

Quality Verification

This model was tested with Whisper transcription to verify output quality:

Test Sentence WER
"Hello, this is a test." 0%
"The quick brown fox jumps over the lazy dog." 0%
"Good morning, how are you today?" 0%
"Machine learning is transforming technology." 0%
"Please remember to save your work frequently." 0%

All test sentences achieved 0% Word Error Rate, matching the full-precision model quality.

Quantization Details

This model uses bitsandbytes NF4 quantization:

  • NF4 (NormalFloat4): Optimized 4-bit data type for neural network weights
  • Double Quantization: Nested quantization for additional memory savings
  • Compute dtype: bfloat16 for computations

The quantization is applied to the Qwen2 LLM backbone while preserving full precision for:

  • Audio tokenizers (semantic and acoustic)
  • Diffusion head

Limitations

  • Requires CUDA GPU with bitsandbytes support
  • Slightly slower inference than full precision (~1.3x)
  • Longer model load time (~65s vs ~24s)

Citation

@misc{vibevoice2024,
  title={VibeVoice: Emotion-Aware Text-to-Speech},
  author={VibeVoice Team},
  year={2024},
  url={https://github.com/vibevoice-community/VibeVoice}
}

License

MIT