Files changed (1) hide show
  1. README.md +27 -142
README.md CHANGED
@@ -64,12 +64,14 @@ library_name: transformers
64
 
65
  ## VibeVoice-ASR (Transformers-compatible version)
66
  [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
 
67
  [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
68
  [![Technical Report](https://img.shields.io/badge/arXiv-2601.18184-b31b1b?logo=arxiv)](https://arxiv.org/pdf/2601.18184)
69
 
70
  **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
71
 
72
  ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
 
73
  ➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br>
74
 
75
  <p align="left">
@@ -99,32 +101,14 @@ library_name: transformers
99
 
100
  VibeVoice-ASR is available as of v5.3.0 of Transformers!
101
 
102
- ```
103
  pip install "transformers>=5.3.0"
104
- ```
105
-
106
- ### Loading model
107
-
108
- ```python
109
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
110
 
111
  model_id = "microsoft/VibeVoice-ASR-HF"
112
  processor = AutoProcessor.from_pretrained(model_id)
113
  model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id)
114
- ```
115
-
116
- ### Speaker-timestamped transcription
117
-
118
- A notable feature of VibeVoice ASR is its ability to transcribe multi-speaker content, denoting who spoke and when.
119
-
120
- The example below transcribes the following audio.
121
-
122
- <audio controls>
123
- <source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav" type="audio/wav">
124
- </audio>
125
-
126
- ```python
127
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
128
 
129
  model_id = "microsoft/VibeVoice-ASR-HF"
130
  processor = AutoProcessor.from_pretrained(model_id)
@@ -133,7 +117,7 @@ print(f"Model loaded on {model.device} with dtype {model.dtype}")
133
 
134
  # Prepare inputs using `apply_transcription_request`
135
  inputs = processor.apply_transcription_request(
136
- audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
137
  ).to(model.device, model.dtype)
138
 
139
  # Apply model
@@ -180,22 +164,7 @@ TRANSCRIPTION ONLY
180
  ============================================================
181
  Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it.
182
  """
183
- ```
184
-
185
- The VibeVoice ASR model is trained to generate a string that resembles a JSON structure. The flag `return_format="parsed"` tries to return the generated output as a list of dicts, while `return_format="transcription_only"` tries to extract only the transcribed audio. If they fail, the generated output is returned as-is.
186
-
187
- ### Providing context
188
-
189
- It is also possible to provide context. This can be useful if certain words cannot be transcribed correctly, such as proper nouns.
190
-
191
- Below we transcribe an audio where the speaker (with a German accent) talks about VibeVoice, comparing with and without the context "About VibeVoice".
192
-
193
- <audio controls>
194
- <source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav" type="audio/wav">
195
- </audio>
196
-
197
- ```python
198
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
199
 
200
  model_id = "microsoft/VibeVoice-ASR-HF"
201
  processor = AutoProcessor.from_pretrained(model_id)
@@ -204,7 +173,7 @@ print(f"Model loaded on {model.device} with dtype {model.dtype}")
204
 
205
  # Without context
206
  inputs = processor.apply_transcription_request(
207
- audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
208
  ).to(model.device, model.dtype)
209
  output_ids = model.generate(**inputs)
210
  generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
@@ -213,7 +182,7 @@ print(f"WITHOUT CONTEXT: {transcription}")
213
 
214
  # With context
215
  inputs = processor.apply_transcription_request(
216
- audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
217
  prompt="About VibeVoice",
218
  ).to(model.device, model.dtype)
219
  output_ids = model.generate(**inputs)
@@ -225,20 +194,12 @@ print(f"WITH CONTEXT : {transcription}")
225
  WITHOUT CONTEXT: Revevoices is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio.
226
  WITH CONTEXT : VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.
227
  """
228
- ```
229
-
230
-
231
- ### Batch inference
232
-
233
- Batch inference is possible by passing a list of audio and (if provided) a list of prompts of equal length.
234
-
235
- ```python
236
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
237
 
238
  model_id = "microsoft/VibeVoice-ASR-HF"
239
  audio = [
240
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
241
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
242
  ]
243
  prompts = ["About VibeVoice", None]
244
 
@@ -252,22 +213,13 @@ generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
252
  transcription = processor.decode(generated_ids, return_format="transcription_only")
253
 
254
  print(transcription)
255
- ```
256
-
257
- ### Adjusting tokenizer chunk (e.g. if out-of-memory)
258
-
259
- A key feature of VibeVoice ASR is that it can transcribe up to 60 minutes of continuous audio. This is done by chunking audio into 60-second segments (1440000 samples at 24kHz) and caching the convolution states between each segment.
260
-
261
- However, if chunks of 60 seconds are too large for your device, the `tokenizer_chunk_size` argument passed to `generate` can be adjusted. *Note it should be a multiple of the hop length (3200 for the original acoustic tokenizer).*
262
-
263
- ```python
264
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
265
 
266
  tokenizer_chunk_size = 64000 # default is 1440000 (60s @ 24kHz)
267
  model_id = "microsoft/VibeVoice-ASR-HF"
268
  audio = [
269
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
270
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
271
  ]
272
  prompts = ["About VibeVoice", None]
273
 
@@ -280,13 +232,7 @@ output_ids = model.generate(**inputs, tokenizer_chunk_size=tokenizer_chunk_size)
280
  generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
281
  transcription = processor.decode(generated_ids, return_format="transcription_only")
282
  print(transcription)
283
- ```
284
-
285
- ### Chat template
286
-
287
- VibeVoice ASR also accepts chat template inputs (`apply_transcription_request` is actually a wrapper for `apply_chat_template` for convenience):
288
- ```python
289
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
290
 
291
  model_id = "microsoft/VibeVoice-ASR-HF"
292
  processor = AutoProcessor.from_pretrained(model_id)
@@ -300,7 +246,7 @@ chat_template = [
300
  {"type": "text", "text": "About VibeVoice"},
301
  {
302
  "type": "audio",
303
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
304
  },
305
  ],
306
  }
@@ -311,7 +257,7 @@ chat_template = [
311
  "content": [
312
  {
313
  "type": "audio",
314
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
315
  },
316
  ],
317
  }
@@ -328,14 +274,7 @@ output_ids = model.generate(**inputs)
328
  generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
329
  transcription = processor.decode(generated_ids, return_format="transcription_only")
330
  print(transcription)
331
- ```
332
-
333
- ### Training
334
-
335
- VibeVoice ASR can be trained with the loss outputted by the model.
336
-
337
- ```python
338
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
339
 
340
  model_id = "microsoft/VibeVoice-ASR-HF"
341
  processor = AutoProcessor.from_pretrained(model_id)
@@ -352,7 +291,7 @@ chat_template = [
352
  {"type": "text", "text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio."},
353
  {
354
  "type": "audio",
355
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
356
  },
357
  ],
358
  }
@@ -364,7 +303,7 @@ chat_template = [
364
  {"type": "text", "text": "Hello everyone and welcome to the VibeVoice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."},
365
  {
366
  "type": "audio",
367
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
368
  },
369
  ],
370
  }
@@ -380,13 +319,7 @@ inputs = processor.apply_chat_template(
380
  loss = model(**inputs).loss
381
  print("Loss:", loss.item())
382
  loss.backward()
383
- ```
384
-
385
- ### Torch compile
386
-
387
- The model can be compiled for faster inference/training.
388
- ```python
389
- import time
390
  import torch
391
  from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
392
 
@@ -411,7 +344,7 @@ chat_template = [
411
  },
412
  {
413
  "type": "audio",
414
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
415
  },
416
  ],
417
  }
@@ -464,14 +397,7 @@ print(f"Average time with compile: {compile_time:.4f}s")
464
 
465
  speedup = no_compile_time / compile_time
466
  print(f"\nSpeedup: {speedup:.2f}x")
467
- ```
468
-
469
- ### Pipeline usage
470
-
471
- The model can be used as a pipeline, but you will have to define your own methods for parsing the raw output.
472
-
473
- ```python
474
- from transformers import pipeline
475
 
476
  model_id = "microsoft/VibeVoice-ASR-HF"
477
  pipe = pipeline("any-to-any", model=model_id, device_map="auto")
@@ -482,7 +408,7 @@ chat_template = [
482
  {"type": "text", "text": "About VibeVoice"},
483
  {
484
  "type": "audio",
485
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
486
  },
487
  ],
488
  }
@@ -498,47 +424,6 @@ print(outputs)
498
  ============================================================
499
  RAW PIPELINE OUTPUT
500
  ============================================================
501
- [{'input_text': [{'role': 'user', 'content': [{'type': 'text', 'text': 'About VibeVoice'}, {'type': 'audio', 'path': 'https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav'}]}], 'generated_text': 'assistant\n[{"Start":0.0,"End":7.56,"Speaker":0,"Content":"VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker conversational audio."}]\n'}]
502
  """
503
- ```
504
-
505
- ## Evaluation
506
-
507
- Below are results from the [technical report](https://arxiv.org/pdf/2601.18184).
508
-
509
- <p align="center">
510
- <img src="figures/DER.jpg" alt="DER" width="70%">
511
- <img src="figures/cpWER.jpg" alt="cpWER" width="70%">
512
- <img src="figures/tcpWER.jpg" alt="tcpWER" width="70%">
513
- </p>
514
-
515
-
516
- ### Open ASR Leaderboard
517
-
518
- On the [Open ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), the following results were obtained:
519
-
520
- | Dataset | WER (%) |
521
- | ---------------------- | -------- |
522
- | ami_test | 17.20 |
523
- | earnings22_test | 13.17 |
524
- | gigaspeech_test | 9.67 |
525
- | librispeech_test.clean | 2.20 |
526
- | librispeech_test.other | 5.51 |
527
- | spgispeech_test | 3.80 |
528
- | tedlium_test | 2.57 |
529
- | voxpopuli_test | 8.01 |
530
- | **Average** | **7.77** |
531
- | **RTFx** | **51.80** |
532
-
533
- ## Language Distribution
534
- <p align="center">
535
- <img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%">
536
- </p>
537
-
538
-
539
- ## License
540
- This project is licensed under the MIT License.
541
-
542
- ## Contact
543
- This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
544
- If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
 
64
 
65
  ## VibeVoice-ASR (Transformers-compatible version)
66
  [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
67
+ [![Deploy on Foundry](https://img.shields.io/badge/Deploy-on_Foundry-blue?logo=microsoft)](https://huggingface.co/docs/microsoft-azure/foundry/guides/vibevoice-asr)
68
  [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
69
  [![Technical Report](https://img.shields.io/badge/arXiv-2601.18184-b31b1b?logo=arxiv)](https://arxiv.org/pdf/2601.18184)
70
 
71
  **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
72
 
73
  ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
74
+ ➡️ **Deploy:** [Deploy VibeVoice ASR on Microsoft Foundry: Long-Form Transcription, +50 Languages Supported & Speaker Diarization](https://huggingface.co/docs/microsoft-azure/foundry/guides/vibevoice-asr)<br>
75
  ➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br>
76
 
77
  <p align="left">
 
101
 
102
  VibeVoice-ASR is available as of v5.3.0 of Transformers!
103
 
104
+ ```bash
105
  pip install "transformers>=5.3.0"
106
+ Loading modelPythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
 
 
 
 
 
107
 
108
  model_id = "microsoft/VibeVoice-ASR-HF"
109
  processor = AutoProcessor.from_pretrained(model_id)
110
  model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id)
111
+ Speaker-timestamped transcriptionA notable feature of VibeVoice ASR is its ability to transcribe multi-speaker content, denoting who spoke and when.The example below transcribes the following audio.<audio controls><source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav" type="audio/wav"></audio>Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  model_id = "microsoft/VibeVoice-ASR-HF"
114
  processor = AutoProcessor.from_pretrained(model_id)
 
117
 
118
  # Prepare inputs using `apply_transcription_request`
119
  inputs = processor.apply_transcription_request(
120
+ audio="[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)",
121
  ).to(model.device, model.dtype)
122
 
123
  # Apply model
 
164
  ============================================================
165
  Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it.
166
  """
167
+ Providing contextIt is also possible to provide context. This can be useful if certain words cannot be transcribed correctly, such as proper nouns.Below we transcribe an audio where the speaker (with a German accent) talks about VibeVoice, comparing with and without the context "About VibeVoice".<audio controls><source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav" type="audio/wav"></audio>Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
  model_id = "microsoft/VibeVoice-ASR-HF"
170
  processor = AutoProcessor.from_pretrained(model_id)
 
173
 
174
  # Without context
175
  inputs = processor.apply_transcription_request(
176
+ audio="[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
177
  ).to(model.device, model.dtype)
178
  output_ids = model.generate(**inputs)
179
  generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
 
182
 
183
  # With context
184
  inputs = processor.apply_transcription_request(
185
+ audio="[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
186
  prompt="About VibeVoice",
187
  ).to(model.device, model.dtype)
188
  output_ids = model.generate(**inputs)
 
194
  WITHOUT CONTEXT: Revevoices is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio.
195
  WITH CONTEXT : VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.
196
  """
197
+ Batch inferenceBatch inference is possible by passing a list of audio and (if provided) a list of prompts of equal length.Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
 
 
 
 
 
 
 
 
198
 
199
  model_id = "microsoft/VibeVoice-ASR-HF"
200
  audio = [
201
+ "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
202
+ "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)"
203
  ]
204
  prompts = ["About VibeVoice", None]
205
 
 
213
  transcription = processor.decode(generated_ids, return_format="transcription_only")
214
 
215
  print(transcription)
216
+ Adjusting tokenizer chunk (e.g. if out-of-memory)A key feature of VibeVoice ASR is that it can transcribe up to 60 minutes of continuous audio. This is done by chunking audio into 60-second segments (1440000 samples at 24kHz) and caching the convolution states between each segment.However, if chunks of 60 seconds are too large for your device, the tokenizer_chunk_size argument passed to generate can be adjusted. Note it should be a multiple of the hop length (3200 for the original acoustic tokenizer).Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
 
 
 
 
 
 
 
 
 
217
 
218
  tokenizer_chunk_size = 64000 # default is 1440000 (60s @ 24kHz)
219
  model_id = "microsoft/VibeVoice-ASR-HF"
220
  audio = [
221
+ "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
222
+ "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)"
223
  ]
224
  prompts = ["About VibeVoice", None]
225
 
 
232
  generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
233
  transcription = processor.decode(generated_ids, return_format="transcription_only")
234
  print(transcription)
235
+ Chat templateVibeVoice ASR also accepts chat template inputs (apply_transcription_request is actually a wrapper for apply_chat_template for convenience):Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
 
 
 
 
 
 
236
 
237
  model_id = "microsoft/VibeVoice-ASR-HF"
238
  processor = AutoProcessor.from_pretrained(model_id)
 
246
  {"type": "text", "text": "About VibeVoice"},
247
  {
248
  "type": "audio",
249
+ "path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
250
  },
251
  ],
252
  }
 
257
  "content": [
258
  {
259
  "type": "audio",
260
+ "path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)",
261
  },
262
  ],
263
  }
 
274
  generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
275
  transcription = processor.decode(generated_ids, return_format="transcription_only")
276
  print(transcription)
277
+ TrainingVibeVoice ASR can be trained with the loss outputted by the model.Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
 
 
 
 
 
 
 
278
 
279
  model_id = "microsoft/VibeVoice-ASR-HF"
280
  processor = AutoProcessor.from_pretrained(model_id)
 
291
  {"type": "text", "text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio."},
292
  {
293
  "type": "audio",
294
+ "path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
295
  },
296
  ],
297
  }
 
303
  {"type": "text", "text": "Hello everyone and welcome to the VibeVoice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."},
304
  {
305
  "type": "audio",
306
+ "path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)",
307
  },
308
  ],
309
  }
 
319
  loss = model(**inputs).loss
320
  print("Loss:", loss.item())
321
  loss.backward()
322
+ Torch compileThe model can be compiled for faster inference/training.Pythonimport time
 
 
 
 
 
 
323
  import torch
324
  from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
325
 
 
344
  },
345
  {
346
  "type": "audio",
347
+ "path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
348
  },
349
  ],
350
  }
 
397
 
398
  speedup = no_compile_time / compile_time
399
  print(f"\nSpeedup: {speedup:.2f}x")
400
+ Pipeline usageThe model can be used as a pipeline, but you will have to define your own methods for parsing the raw output.Pythonfrom transformers import pipeline
 
 
 
 
 
 
 
401
 
402
  model_id = "microsoft/VibeVoice-ASR-HF"
403
  pipe = pipeline("any-to-any", model=model_id, device_map="auto")
 
408
  {"type": "text", "text": "About VibeVoice"},
409
  {
410
  "type": "audio",
411
+ "path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
412
  },
413
  ],
414
  }
 
424
  ============================================================
425
  RAW PIPELINE OUTPUT
426
  ============================================================
427
+ [{'input_text': [{'role': 'user', 'content': [{'type': 'text', 'text': 'About VibeVoice'}, {'type': 'audio', 'path': '[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)'}]}], 'generated_text': 'assistant\n[{"Start":0.0,"End":7.56,"Speaker":0,"Content":"VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker conversational audio."}]\n'}]
428
  """
429
+ EvaluationBelow are results from the technical report.<p align="center"><img src="figures/DER.jpg" alt="DER" width="70%"><img src="figures/cpWER.jpg" alt="cpWER" width="70%"><img src="figures/tcpWER.jpg" alt="tcpWER" width="70%"></p>Open ASR LeaderboardOn the Open ASR leaderboard, the following results were obtained:DatasetWER (%)ami_test17.20earnings22_test13.17gigaspeech_test9.67librispeech_test.clean2.20librispeech_test.other5.51spgispeech_test3.80tedlium_test2.57voxpopuli_test8.01Average7.77RTFx51.80Language Distribution<p align="center"><img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%"></p>LicenseThis project is licensed under the MIT License.ContactThis project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.