File size: 3,288 Bytes
17d4b29 ed741e8 17d4b29 b1269cf 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 17d4b29 ed741e8 fecb8fe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
language:
- en
- zh
license: mit
tags:
- audio tokenizer
library_name: transformers
pipeline_tag: feature-extraction
---
# 🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._
# VibeVoice-SemanticTokenizer
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
The model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models.
➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
# Models
🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._
| Model | Context Length | Generation Length | Weight |
|-------|----------------|----------|----------|
| VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) |
| VibeVoice-7B| 32K | ~45 min | [HF link](https://huggingface.co/microsoft/VibeVoice-7B) |
| VibeVoice-AcousticTokenizer | - | - | [HF link](https://huggingface.co/microsoft/VibeVoice-AcousticTokenizer) |
| VibeVoice-SemanticTokenizer | - | - | This model |
# Usage
Below is example usage to encode audio for extracting semantic features:
```python
import torch
from transformers import AutoFeatureExtractor, VibeVoiceSemanticTokenizerModel
from transformers.audio_utils import load_audio_librosa
model_id = "bezzam/VibeVoice-SemanticTokenizer"
sampling_rate = 24000
# load audio
audio = load_audio_librosa(
"https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
sampling_rate=sampling_rate,
)
# load model
device = "cuda" if torch.cuda.is_available() else "cpu"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = VibeVoiceSemanticTokenizerModel.from_pretrained(
model_id,
device_map=device,
).eval()
# preprocess audio
inputs = feature_extractor(
audio,
sampling_rate=sampling_rate,
padding=True,
pad_to_multiple_of=3200,
return_attention_mask=False,
return_tensors="pt",
).to(device)
print("Input audio shape:", inputs.input_features.shape)
# Input audio shape: torch.Size([1, 1, 224000])
# encode
with torch.no_grad():
encoded_outputs = model.encode(inputs.input_features)
print("Latent shape:", encoded_outputs.latents.shape)
# Latent shape: torch.Size([1, 70, 128])
``` |