frontierai commited on
Commit
f0643be
·
verified ·
1 Parent(s): e792181

Initial commit

Browse files
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - audio tokenizer
5
+ library_name: transformers
6
+ pipeline_tag: feature-extraction
7
+ ---
8
+
9
+
10
+ # VibeVoice Acoustic Tokenizer
11
+
12
+ VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
13
+
14
+ A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
15
+
16
+ The speech tokenizer is a key component for both VibeVoice [TTS](https://huggingface.co/microsoft/VibeVoice-1.5B) and [ASR](https://huggingface.co/microsoft/VibeVoice-ASR).
17
+
18
+ ➡️ **Technical Report:** [VibeVoice Technical Report](https://arxiv.org/abs/2508.19205)
19
+
20
+ ➡️ **Project Page:** [microsoft/VibeVoice](https://microsoft.github.io/VibeVoice)
21
+
22
+ <p align="left">
23
+ <img src="figs/tokenizer_comparison.png" alt="Tokenizer Comparison" height="250px">
24
+ </p>
25
+
26
+
27
+ # Models
28
+
29
+ | Model | Context Length | Length (min) | Weight |
30
+ |-------|----------------|----------|----------|
31
+ | VibeVoice-Realtime-0.5B | 8K | ~10 min | [HF link](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) |
32
+ | VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) |
33
+ | VibeVoice-ASR | 64K | ~60 min | [HF link](https://huggingface.co/microsoft/VibeVoice-ASR) |
34
+ | VibeVoice-AcousticTokenizer | - | - | This model |
35
+
36
+
37
+ # Usage
38
+
39
+ ## Setup
40
+
41
+ Until the VibeVoice acoustic tokenizer is part of an official Transformers release, it can be used by installing from the source code:
42
+ ```python
43
+ pip install git+https://github.com/huggingface/transformers.git
44
+ ```
45
+
46
+ ## Example
47
+
48
+ <details>
49
+ <summary>Encoding and decoding</summary>
50
+
51
+ ```python
52
+ import torch
53
+ from scipy.io import wavfile
54
+
55
+ from transformers import AutoFeatureExtractor, VibeVoiceAcousticTokenizerModel
56
+ from transformers.audio_utils import load_audio_librosa
57
+
58
+
59
+ model_id = "microsoft/VibeVoice-AcousticTokenizer"
60
+
61
+ # load model
62
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
63
+ model = VibeVoiceAcousticTokenizerModel.from_pretrained(model_id, device_map="auto")
64
+ print("Model loaded on device:", model.device)
65
+ print("Model dtype:", model.dtype)
66
+
67
+ # load audio
68
+ audio = load_audio_librosa(
69
+ "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
70
+ sampling_rate=feature_extractor.sampling_rate,
71
+ )
72
+
73
+ # preprocess audio
74
+ inputs = feature_extractor(
75
+ audio,
76
+ sampling_rate=feature_extractor.sampling_rate,
77
+ pad_to_multiple_of=3200,
78
+ ).to(model.device, model.dtype)
79
+ print("Input audio shape:", inputs.input_values.shape)
80
+ # Input audio shape: torch.Size([1, 1, 224000])
81
+
82
+ with torch.no_grad():
83
+ # set VAE sampling to False for deterministic output
84
+ encoded_outputs = model.encode(inputs.input_values, sample=False)
85
+ print("Latent shape:", encoded_outputs.latents.shape)
86
+ # Latent shape: torch.Size([1, 70, 64])
87
+
88
+ decoded_outputs = model.decode(**encoded_outputs)
89
+ print("Reconstructed audio shape:", decoded_outputs.audio.shape)
90
+ # Reconstructed audio shape: torch.Size([1, 1, 224000])
91
+
92
+ # Save audio
93
+ output_fp = "vibevoice_acoustic_tokenizer_reconstructed.wav"
94
+ wavfile.write(output_fp, feature_extractor.sampling_rate, decoded_outputs.audio.squeeze().float().cpu().numpy())
95
+ print(f"Reconstructed audio saved to : {output_fp}")
96
+ ```
97
+
98
+ </details>
99
+
100
+ **Original audio**
101
+ <audio controls>
102
+ <source src="https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav" type="audio/wav">
103
+ </audio>
104
+
105
+ **Encoded/decoded audio**
106
+ <audio controls>
107
+ <source src="https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/vibevoice_acoustic_tokenizer_reconstructed.wav" type="audio/wav">
108
+ </audio>
109
+
110
+
111
+ <details>
112
+ <summary>Streaming</summary>
113
+
114
+ For streaming ASR or TTS, where cached states need to be tracked, the `use_cache` parameter can be used when encoding or decoding audio:
115
+
116
+ ```python
117
+ import torch
118
+ from scipy.io import wavfile
119
+
120
+ from transformers import AutoFeatureExtractor, VibeVoiceAcousticTokenizerModel
121
+ from transformers.audio_utils import load_audio_librosa
122
+
123
+
124
+ model_id = "microsoft/VibeVoice-AcousticTokenizer"
125
+
126
+
127
+ # load model
128
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
129
+ model = VibeVoiceAcousticTokenizerModel.from_pretrained(model_id, device_map="auto")
130
+ print("Model loaded on device:", model.device)
131
+ print("Model dtype:", model.dtype)
132
+
133
+ # load audio
134
+ audio = load_audio_librosa(
135
+ "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
136
+ sampling_rate=feature_extractor.sampling_rate,
137
+ )
138
+
139
+ # preprocess audio
140
+ inputs = feature_extractor(
141
+ audio,
142
+ sampling_rate=feature_extractor.sampling_rate,
143
+ pad_to_multiple_of=3200,
144
+ ).to(model.device, model.dtype)
145
+ print("Input audio shape:", inputs.input_values.shape)
146
+ # Input audio shape: torch.Size([1, 1, 224000])
147
+
148
+ # chache will be initialized after a first pass
149
+ encoder_cache = None
150
+ decoder_cache = None
151
+ with torch.no_grad():
152
+ # set VAE sampling to False for deterministic output
153
+ encoded_outputs = model.encode(inputs.input_values, sample=False, padding_cache=encoder_cache, use_cache=True)
154
+ print("Latent shape:", encoded_outputs.latents.shape)
155
+ # Latent shape: torch.Size([1, 70, 64])
156
+
157
+ decoded_outputs = model.decode(encoded_outputs.latents, padding_cache=decoder_cache, use_cache=True)
158
+ print("Reconstructed audio shape:", decoded_outputs.audio.shape)
159
+ # Reconstructed audio shape: torch.Size([1, 1, 224000])
160
+
161
+ # `padding_cache` can be extracted from the outputs for subsequent passes
162
+ encoder_cache = encoded_outputs.padding_cache
163
+ print("Number of cached encoder layers:", len(encoder_cache.per_layer_in_channels))
164
+ # Number of cached encoder layers: 34
165
+ decoder_cache = decoded_outputs.padding_cache
166
+ print("Number of cached decoder layers:", len(decoder_cache.per_layer_in_channels))
167
+ # Number of cached decoder layers: 34
168
+
169
+ # Save audio
170
+ output_fp = "vibevoice_acoustic_tokenizer_reconstructed.wav"
171
+ wavfile.write(output_fp, feature_extractor.sampling_rate, decoded_outputs.audio.squeeze().float().cpu().numpy())
172
+ print(f"Reconstructed audio saved to : {output_fp}")
173
+
174
+ ```
175
+
176
+ </details>
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "VibeVoiceAcousticTokenizerModel"
4
+ ],
5
+ "channels": 1,
6
+ "depths": [
7
+ 3,
8
+ 3,
9
+ 3,
10
+ 3,
11
+ 3,
12
+ 3,
13
+ 8
14
+ ],
15
+ "downsampling_ratios": [
16
+ 2,
17
+ 2,
18
+ 4,
19
+ 5,
20
+ 5,
21
+ 8
22
+ ],
23
+ "dtype": "bfloat16",
24
+ "ffn_expansion": 4,
25
+ "hidden_act": "gelu",
26
+ "hidden_size": 64,
27
+ "initializer_range": 0.01,
28
+ "kernel_size": 7,
29
+ "layer_scale_init_value": 1e-06,
30
+ "model_type": "vibevoice_acoustic_tokenizer",
31
+ "num_filters": 32,
32
+ "rms_norm_eps": 1e-05,
33
+ "transformers_version": "5.0.1.dev0",
34
+ "vae_std": 0.625,
35
+ "weight_init_value": 0.01
36
+ }
figs/tokenizer_comparison.png ADDED
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3acc2dcc75c6b18dffdc74e9ec7a79ea3849ccf69323499fd9bf54209e531a6a
3
+ size 1374847314
preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eps": 1e-06,
3
+ "feature_extractor_type": "VibeVoiceAcousticTokenizerFeatureExtractor",
4
+ "feature_size": 1,
5
+ "normalize_audio": true,
6
+ "padding_side": "right",
7
+ "padding_value": 0.0,
8
+ "return_attention_mask": true,
9
+ "sampling_rate": 24000,
10
+ "target_dB_FS": -25
11
+ }