|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
license: mit |
|
|
tags: |
|
|
- audio tokenizer |
|
|
library_name: transformers |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# 🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._ |
|
|
|
|
|
# VibeVoice-SemanticTokenizer |
|
|
|
|
|
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. |
|
|
|
|
|
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details. |
|
|
|
|
|
The model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models. |
|
|
|
|
|
➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205) |
|
|
|
|
|
➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice) |
|
|
|
|
|
|
|
|
# Models |
|
|
|
|
|
🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._ |
|
|
|
|
|
| Model | Context Length | Generation Length | Weight | |
|
|
|-------|----------------|----------|----------| |
|
|
| VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) | |
|
|
| VibeVoice-7B| 32K | ~45 min | [HF link](https://huggingface.co/microsoft/VibeVoice-7B) | |
|
|
| VibeVoice-AcousticTokenizer | - | - | [HF link](https://huggingface.co/microsoft/VibeVoice-AcousticTokenizer) | |
|
|
| VibeVoice-SemanticTokenizer | - | - | This model | |
|
|
|
|
|
|
|
|
# Usage |
|
|
|
|
|
Below is example usage to encode audio for extracting semantic features: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoFeatureExtractor, VibeVoiceSemanticTokenizerModel |
|
|
from transformers.audio_utils import load_audio_librosa |
|
|
|
|
|
|
|
|
model_id = "bezzam/VibeVoice-SemanticTokenizer" |
|
|
sampling_rate = 24000 |
|
|
|
|
|
# load audio |
|
|
audio = load_audio_librosa( |
|
|
"https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav", |
|
|
sampling_rate=sampling_rate, |
|
|
) |
|
|
|
|
|
# load model |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) |
|
|
model = VibeVoiceSemanticTokenizerModel.from_pretrained( |
|
|
model_id, |
|
|
device_map=device, |
|
|
).eval() |
|
|
|
|
|
# preprocess audio |
|
|
inputs = feature_extractor( |
|
|
audio, |
|
|
sampling_rate=sampling_rate, |
|
|
padding=True, |
|
|
pad_to_multiple_of=3200, |
|
|
return_attention_mask=False, |
|
|
return_tensors="pt", |
|
|
).to(device) |
|
|
print("Input audio shape:", inputs.input_features.shape) |
|
|
# Input audio shape: torch.Size([1, 1, 224000]) |
|
|
|
|
|
# encode |
|
|
with torch.no_grad(): |
|
|
encoded_outputs = model.encode(inputs.input_features) |
|
|
print("Latent shape:", encoded_outputs.latents.shape) |
|
|
# Latent shape: torch.Size([1, 70, 128]) |
|
|
``` |