Initial commit

Browse files

Files changed (5) hide show

README.md +176 -0
config.json +36 -0
figs/tokenizer_comparison.png +0 -0
model.safetensors +3 -0
preprocessor_config.json +11 -0

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+---
+license: mit
+tags:
+- audio tokenizer
+library_name: transformers
+pipeline_tag: feature-extraction
+---
+# VibeVoice Acoustic Tokenizer
+VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
+A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
+The speech tokenizer is a key component for both VibeVoice [TTS](https://huggingface.co/microsoft/VibeVoice-1.5B) and [ASR](https://huggingface.co/microsoft/VibeVoice-ASR).
+➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
+➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
+<p align="left">
+  <img src="figs/tokenizer_comparison.png" alt="Tokenizer Comparison" height="250px">
+</p>
+# Models
+| Model | Context Length | Length (min) |  Weight |
+|-------|----------------|----------|----------|
+| VibeVoice-Realtime-0.5B | 8K | ~10 min | [HF link](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) |
+| VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) |
+| VibeVoice-ASR | 64K | ~60 min | [HF link](https://huggingface.co/microsoft/VibeVoice-ASR) |
+| VibeVoice-AcousticTokenizer | - | - | This model |
+# Usage
+## Setup
+Until the VibeVoice acoustic tokenizer is part of an official Transformers release, it can be used by installing from the source code:
+```python
+pip install git+https://github.com/huggingface/transformers.git
+```
+## Example
+<details>
+  <summary>Encoding and decoding</summary>
+```python
+import torch
+from scipy.io import wavfile
+from transformers import AutoFeatureExtractor, VibeVoiceAcousticTokenizerModel
+from transformers.audio_utils import load_audio_librosa
+model_id = "microsoft/VibeVoice-AcousticTokenizer"
+# load model
+feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
+model = VibeVoiceAcousticTokenizerModel.from_pretrained(model_id, device_map="auto")
+print("Model loaded on device:", model.device)
+print("Model dtype:", model.dtype)
+# load audio
+audio = load_audio_librosa(
+    "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
+    sampling_rate=feature_extractor.sampling_rate,
+)
+# preprocess audio
+inputs = feature_extractor(
+    audio,
+    sampling_rate=feature_extractor.sampling_rate,
+    pad_to_multiple_of=3200,
+).to(model.device, model.dtype)
+print("Input audio shape:", inputs.input_values.shape)
+# Input audio shape: torch.Size([1, 1, 224000])
+with torch.no_grad():
+    # set VAE sampling to False for deterministic output
+    encoded_outputs = model.encode(inputs.input_values, sample=False)
+    print("Latent shape:", encoded_outputs.latents.shape)
+    # Latent shape: torch.Size([1, 70, 64])
+    decoded_outputs = model.decode(**encoded_outputs)
+    print("Reconstructed audio shape:", decoded_outputs.audio.shape)
+    # Reconstructed audio shape: torch.Size([1, 1, 224000])
+# Save audio
+output_fp = "vibevoice_acoustic_tokenizer_reconstructed.wav"
+wavfile.write(output_fp, feature_extractor.sampling_rate, decoded_outputs.audio.squeeze().float().cpu().numpy())
+print(f"Reconstructed audio saved to : {output_fp}")
+```
+</details>
+**Original audio**
+<audio controls>
+  <source src="https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav" type="audio/wav">
+</audio>
+**Encoded/decoded audio**
+<audio controls>
+  <source src="https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/vibevoice_acoustic_tokenizer_reconstructed.wav" type="audio/wav">
+</audio>
+<details>
+  <summary>Streaming</summary>
+For streaming ASR or TTS, where cached states need to be tracked, the `use_cache` parameter can be used when encoding or decoding audio:
+```python
+import torch
+from scipy.io import wavfile
+from transformers import AutoFeatureExtractor, VibeVoiceAcousticTokenizerModel
+from transformers.audio_utils import load_audio_librosa
+model_id = "microsoft/VibeVoice-AcousticTokenizer"
+# load model
+feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
+model = VibeVoiceAcousticTokenizerModel.from_pretrained(model_id, device_map="auto")
+print("Model loaded on device:", model.device)
+print("Model dtype:", model.dtype)
+# load audio
+audio = load_audio_librosa(
+    "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
+    sampling_rate=feature_extractor.sampling_rate,
+)
+# preprocess audio
+inputs = feature_extractor(
+    audio,
+    sampling_rate=feature_extractor.sampling_rate,
+    pad_to_multiple_of=3200,
+).to(model.device, model.dtype)
+print("Input audio shape:", inputs.input_values.shape)
+# Input audio shape: torch.Size([1, 1, 224000])
+# chache will be initialized after a first pass
+encoder_cache = None
+decoder_cache = None
+with torch.no_grad():
+    # set VAE sampling to False for deterministic output
+    encoded_outputs = model.encode(inputs.input_values, sample=False, padding_cache=encoder_cache, use_cache=True)
+    print("Latent shape:", encoded_outputs.latents.shape)
+    # Latent shape: torch.Size([1, 70, 64])
+    decoded_outputs = model.decode(encoded_outputs.latents, padding_cache=decoder_cache, use_cache=True)
+    print("Reconstructed audio shape:", decoded_outputs.audio.shape)
+    # Reconstructed audio shape: torch.Size([1, 1, 224000])
+    # `padding_cache` can be extracted from the outputs for subsequent passes
+    encoder_cache = encoded_outputs.padding_cache
+    print("Number of cached encoder layers:", len(encoder_cache.per_layer_in_channels))
+    # Number of cached encoder layers: 34
+    decoder_cache = decoded_outputs.padding_cache
+    print("Number of cached decoder layers:", len(decoder_cache.per_layer_in_channels))
+    # Number of cached decoder layers: 34
+# Save audio
+output_fp = "vibevoice_acoustic_tokenizer_reconstructed.wav"
+wavfile.write(output_fp, feature_extractor.sampling_rate, decoded_outputs.audio.squeeze().float().cpu().numpy())
+print(f"Reconstructed audio saved to : {output_fp}")
+```
+</details>

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "architectures": [
+    "VibeVoiceAcousticTokenizerModel"
+  ],
+  "channels": 1,
+  "depths": [
+    3,
+    3,
+    3,
+    3,
+    3,
+    3,
+    8
+  ],
+  "downsampling_ratios": [
+    2,
+    2,
+    4,
+    5,
+    5,
+    8
+  ],
+  "dtype": "bfloat16",
+  "ffn_expansion": 4,
+  "hidden_act": "gelu",
+  "hidden_size": 64,
+  "initializer_range": 0.01,
+  "kernel_size": 7,
+  "layer_scale_init_value": 1e-06,
+  "model_type": "vibevoice_acoustic_tokenizer",
+  "num_filters": 32,
+  "rms_norm_eps": 1e-05,
+  "transformers_version": "5.0.1.dev0",
+  "vae_std": 0.625,
+  "weight_init_value": 0.01
+}

figs/tokenizer_comparison.png ADDED Viewed

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3acc2dcc75c6b18dffdc74e9ec7a79ea3849ccf69323499fd9bf54209e531a6a
+size 1374847314

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "eps": 1e-06,
+  "feature_extractor_type": "VibeVoiceAcousticTokenizerFeatureExtractor",
+  "feature_size": 1,
+  "normalize_audio": true,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": true,
+  "sampling_rate": 24000,
+  "target_dB_FS": -25
+}