bezzam
/

VibeVoice-SemanticTokenizer

Feature Extraction

vibevoice_semantic_tokenizer

audio tokenizer

Model card Files Files and versions

VibeVoice-SemanticTokenizer / README.md

bezzam's picture

bezzam HF Staff

Update README.md

fecb8fe verified about 1 month ago

|

history blame contribute delete

3.29 kB

	---
	language:
	- en
	- zh
	license: mit
	tags:
	- audio tokenizer
	library_name: transformers
	pipeline_tag: feature-extraction
	---

	# 🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._

	# VibeVoice-SemanticTokenizer

	VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

	A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

	The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

	➡️ Technical Report: [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)

	➡️ Project Page: [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)


	# Models

	🚨 _Note: This is a draft model card. Actual model links can be found in [this collection](https://huggingface.co/collections/bezzam/vibevoice)._

	\| Model \| Context Length \| Generation Length \| Weight \|
	\|-------\|----------------\|----------\|----------\|
	\| VibeVoice-1.5B \| 64K \| ~90 min \| [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) \|
	\| VibeVoice-7B\| 32K \| ~45 min \| [HF link](https://huggingface.co/microsoft/VibeVoice-7B) \|
	\| VibeVoice-AcousticTokenizer \| - \| - \| [HF link](https://huggingface.co/microsoft/VibeVoice-AcousticTokenizer) \|
	\| VibeVoice-SemanticTokenizer \| - \| - \| This model \|


	# Usage

	Below is example usage to encode audio for extracting semantic features:

	```python
	import torch
	from transformers import AutoFeatureExtractor, VibeVoiceSemanticTokenizerModel
	from transformers.audio_utils import load_audio_librosa


	model_id = "bezzam/VibeVoice-SemanticTokenizer"
	sampling_rate = 24000

	# load audio
	audio = load_audio_librosa(
	"https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
	sampling_rate=sampling_rate,
	)

	# load model
	device = "cuda" if torch.cuda.is_available() else "cpu"
	feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
	model = VibeVoiceSemanticTokenizerModel.from_pretrained(
	model_id,
	device_map=device,
	).eval()

	# preprocess audio
	inputs = feature_extractor(
	audio,
	sampling_rate=sampling_rate,
	padding=True,
	pad_to_multiple_of=3200,
	return_attention_mask=False,
	return_tensors="pt",
	).to(device)
	print("Input audio shape:", inputs.input_features.shape)
	# Input audio shape: torch.Size([1, 1, 224000])

	# encode
	with torch.no_grad():
	encoded_outputs = model.encode(inputs.input_features)
	print("Latent shape:", encoded_outputs.latents.shape)
	# Latent shape: torch.Size([1, 70, 128])
	```