Upload README.md with huggingface_hub

398e33f verified 4 months ago

4.39 kB

license: mit
language:
  - en
  - zh
tags:
  - text-to-speech
  - tts
  - speech-synthesis
  - vibevoice
  - bitsandbytes
  - 4bit
  - quantized
library_name: transformers
base_model: vibevoice/VibeVoice-7B
pipeline_tag: text-to-speech

VibeVoice 7B - 4-bit Quantized (bitsandbytes NF4)

This is a 4-bit quantized version of VibeVoice 7B using bitsandbytes NF4 quantization.

Model Details

Property	Value
Base Model	vibevoice/VibeVoice-7B
Quantization	bitsandbytes NF4 (4-bit)
VRAM Usage	~6.2 GB
Model Size	~6.2 GB on disk
Sample Rate	24kHz

VRAM Comparison

Mode	VRAM	Reduction
Full bfloat16	~17 GB	baseline
ao-int8	~9.4 GB	45%
bnb-4bit	~6.2 GB	64%

Quick Start

Installation

pip install transformers bitsandbytes torch torchaudio
pip install git+https://github.com/vibevoice-community/VibeVoice.git

Usage

import torch
from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor

# Load quantized model
model_id = "marksverdhai/vibevoice-7b-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    model_id,
    device_map={"": 0},  # Load on GPU 0
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)
processor = VibeVoiceProcessor.from_pretrained(model_id)

model.eval()
model.set_ddpm_inference_steps(num_steps=10)

# Generate speech
text = "Speaker 1: Hello! This is VibeVoice, a state-of-the-art text-to-speech model."

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=None,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        generation_config={"do_sample": False},
        verbose=False,
        is_prefill=False,
    )

# Get audio
audio = outputs.speech_outputs[0].squeeze().cpu()
sample_rate = 24000

# Save to file
import torchaudio
torchaudio.save("output.wav", audio.unsqueeze(0), sample_rate)

Voice Cloning

# With voice reference
inputs = processor(
    text=["Speaker 1: Hello, I can clone any voice!"],
    voice_samples=[["path/to/reference.wav"]],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        is_prefill=True,  # Enable voice cloning
    )

Quality Verification

This model was tested with Whisper transcription to verify output quality:

Test Sentence	WER
"Hello, this is a test."	0%
"The quick brown fox jumps over the lazy dog."	0%
"Good morning, how are you today?"	0%
"Machine learning is transforming technology."	0%
"Please remember to save your work frequently."	0%

All test sentences achieved 0% Word Error Rate, matching the full-precision model quality.

Quantization Details

This model uses bitsandbytes NF4 quantization:

NF4 (NormalFloat4): Optimized 4-bit data type for neural network weights
Double Quantization: Nested quantization for additional memory savings
Compute dtype: bfloat16 for computations

The quantization is applied to the Qwen2 LLM backbone while preserving full precision for:

Audio tokenizers (semantic and acoustic)
Diffusion head

Limitations

Requires CUDA GPU with bitsandbytes support
Slightly slower inference than full precision (~1.3x)
Longer model load time (~65s vs ~24s)

Citation

@misc{vibevoice2024,
  title={VibeVoice: Emotion-Aware Text-to-Speech},
  author={VibeVoice Team},
  year={2024},
  url={https://github.com/vibevoice-community/VibeVoice}
}

License

MIT