Update README.md
#4
by sd983527 - opened
README.md
CHANGED
|
@@ -64,12 +64,14 @@ library_name: transformers
|
|
| 64 |
|
| 65 |
## VibeVoice-ASR (Transformers-compatible version)
|
| 66 |
[](https://github.com/microsoft/VibeVoice)
|
|
|
|
| 67 |
[](https://aka.ms/vibevoice-asr)
|
| 68 |
[](https://arxiv.org/pdf/2601.18184)
|
| 69 |
|
| 70 |
**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
|
| 71 |
|
| 72 |
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
|
|
|
|
| 73 |
➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br>
|
| 74 |
|
| 75 |
<p align="left">
|
|
@@ -99,32 +101,14 @@ library_name: transformers
|
|
| 99 |
|
| 100 |
VibeVoice-ASR is available as of v5.3.0 of Transformers!
|
| 101 |
|
| 102 |
-
```
|
| 103 |
pip install "transformers>=5.3.0"
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
### Loading model
|
| 107 |
-
|
| 108 |
-
```python
|
| 109 |
-
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 110 |
|
| 111 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 112 |
processor = AutoProcessor.from_pretrained(model_id)
|
| 113 |
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id)
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
### Speaker-timestamped transcription
|
| 117 |
-
|
| 118 |
-
A notable feature of VibeVoice ASR is its ability to transcribe multi-speaker content, denoting who spoke and when.
|
| 119 |
-
|
| 120 |
-
The example below transcribes the following audio.
|
| 121 |
-
|
| 122 |
-
<audio controls>
|
| 123 |
-
<source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav" type="audio/wav">
|
| 124 |
-
</audio>
|
| 125 |
-
|
| 126 |
-
```python
|
| 127 |
-
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 128 |
|
| 129 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 130 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
@@ -133,7 +117,7 @@ print(f"Model loaded on {model.device} with dtype {model.dtype}")
|
|
| 133 |
|
| 134 |
# Prepare inputs using `apply_transcription_request`
|
| 135 |
inputs = processor.apply_transcription_request(
|
| 136 |
-
audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
|
| 137 |
).to(model.device, model.dtype)
|
| 138 |
|
| 139 |
# Apply model
|
|
@@ -180,22 +164,7 @@ TRANSCRIPTION ONLY
|
|
| 180 |
============================================================
|
| 181 |
Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it.
|
| 182 |
"""
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
The VibeVoice ASR model is trained to generate a string that resembles a JSON structure. The flag `return_format="parsed"` tries to return the generated output as a list of dicts, while `return_format="transcription_only"` tries to extract only the transcribed audio. If they fail, the generated output is returned as-is.
|
| 186 |
-
|
| 187 |
-
### Providing context
|
| 188 |
-
|
| 189 |
-
It is also possible to provide context. This can be useful if certain words cannot be transcribed correctly, such as proper nouns.
|
| 190 |
-
|
| 191 |
-
Below we transcribe an audio where the speaker (with a German accent) talks about VibeVoice, comparing with and without the context "About VibeVoice".
|
| 192 |
-
|
| 193 |
-
<audio controls>
|
| 194 |
-
<source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav" type="audio/wav">
|
| 195 |
-
</audio>
|
| 196 |
-
|
| 197 |
-
```python
|
| 198 |
-
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 199 |
|
| 200 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 201 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
@@ -204,7 +173,7 @@ print(f"Model loaded on {model.device} with dtype {model.dtype}")
|
|
| 204 |
|
| 205 |
# Without context
|
| 206 |
inputs = processor.apply_transcription_request(
|
| 207 |
-
audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 208 |
).to(model.device, model.dtype)
|
| 209 |
output_ids = model.generate(**inputs)
|
| 210 |
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
|
|
@@ -213,7 +182,7 @@ print(f"WITHOUT CONTEXT: {transcription}")
|
|
| 213 |
|
| 214 |
# With context
|
| 215 |
inputs = processor.apply_transcription_request(
|
| 216 |
-
audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 217 |
prompt="About VibeVoice",
|
| 218 |
).to(model.device, model.dtype)
|
| 219 |
output_ids = model.generate(**inputs)
|
|
@@ -225,20 +194,12 @@ print(f"WITH CONTEXT : {transcription}")
|
|
| 225 |
WITHOUT CONTEXT: Revevoices is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio.
|
| 226 |
WITH CONTEXT : VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.
|
| 227 |
"""
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
### Batch inference
|
| 232 |
-
|
| 233 |
-
Batch inference is possible by passing a list of audio and (if provided) a list of prompts of equal length.
|
| 234 |
-
|
| 235 |
-
```python
|
| 236 |
-
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 237 |
|
| 238 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 239 |
audio = [
|
| 240 |
-
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 241 |
-
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
|
| 242 |
]
|
| 243 |
prompts = ["About VibeVoice", None]
|
| 244 |
|
|
@@ -252,22 +213,13 @@ generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
|
|
| 252 |
transcription = processor.decode(generated_ids, return_format="transcription_only")
|
| 253 |
|
| 254 |
print(transcription)
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
### Adjusting tokenizer chunk (e.g. if out-of-memory)
|
| 258 |
-
|
| 259 |
-
A key feature of VibeVoice ASR is that it can transcribe up to 60 minutes of continuous audio. This is done by chunking audio into 60-second segments (1440000 samples at 24kHz) and caching the convolution states between each segment.
|
| 260 |
-
|
| 261 |
-
However, if chunks of 60 seconds are too large for your device, the `tokenizer_chunk_size` argument passed to `generate` can be adjusted. *Note it should be a multiple of the hop length (3200 for the original acoustic tokenizer).*
|
| 262 |
-
|
| 263 |
-
```python
|
| 264 |
-
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 265 |
|
| 266 |
tokenizer_chunk_size = 64000 # default is 1440000 (60s @ 24kHz)
|
| 267 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 268 |
audio = [
|
| 269 |
-
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 270 |
-
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
|
| 271 |
]
|
| 272 |
prompts = ["About VibeVoice", None]
|
| 273 |
|
|
@@ -280,13 +232,7 @@ output_ids = model.generate(**inputs, tokenizer_chunk_size=tokenizer_chunk_size)
|
|
| 280 |
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
|
| 281 |
transcription = processor.decode(generated_ids, return_format="transcription_only")
|
| 282 |
print(transcription)
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
### Chat template
|
| 286 |
-
|
| 287 |
-
VibeVoice ASR also accepts chat template inputs (`apply_transcription_request` is actually a wrapper for `apply_chat_template` for convenience):
|
| 288 |
-
```python
|
| 289 |
-
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 290 |
|
| 291 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 292 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
@@ -300,7 +246,7 @@ chat_template = [
|
|
| 300 |
{"type": "text", "text": "About VibeVoice"},
|
| 301 |
{
|
| 302 |
"type": "audio",
|
| 303 |
-
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 304 |
},
|
| 305 |
],
|
| 306 |
}
|
|
@@ -311,7 +257,7 @@ chat_template = [
|
|
| 311 |
"content": [
|
| 312 |
{
|
| 313 |
"type": "audio",
|
| 314 |
-
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
|
| 315 |
},
|
| 316 |
],
|
| 317 |
}
|
|
@@ -328,14 +274,7 @@ output_ids = model.generate(**inputs)
|
|
| 328 |
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
|
| 329 |
transcription = processor.decode(generated_ids, return_format="transcription_only")
|
| 330 |
print(transcription)
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
### Training
|
| 334 |
-
|
| 335 |
-
VibeVoice ASR can be trained with the loss outputted by the model.
|
| 336 |
-
|
| 337 |
-
```python
|
| 338 |
-
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 339 |
|
| 340 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 341 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
@@ -352,7 +291,7 @@ chat_template = [
|
|
| 352 |
{"type": "text", "text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio."},
|
| 353 |
{
|
| 354 |
"type": "audio",
|
| 355 |
-
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 356 |
},
|
| 357 |
],
|
| 358 |
}
|
|
@@ -364,7 +303,7 @@ chat_template = [
|
|
| 364 |
{"type": "text", "text": "Hello everyone and welcome to the VibeVoice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."},
|
| 365 |
{
|
| 366 |
"type": "audio",
|
| 367 |
-
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
|
| 368 |
},
|
| 369 |
],
|
| 370 |
}
|
|
@@ -380,13 +319,7 @@ inputs = processor.apply_chat_template(
|
|
| 380 |
loss = model(**inputs).loss
|
| 381 |
print("Loss:", loss.item())
|
| 382 |
loss.backward()
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
### Torch compile
|
| 386 |
-
|
| 387 |
-
The model can be compiled for faster inference/training.
|
| 388 |
-
```python
|
| 389 |
-
import time
|
| 390 |
import torch
|
| 391 |
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 392 |
|
|
@@ -411,7 +344,7 @@ chat_template = [
|
|
| 411 |
},
|
| 412 |
{
|
| 413 |
"type": "audio",
|
| 414 |
-
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 415 |
},
|
| 416 |
],
|
| 417 |
}
|
|
@@ -464,14 +397,7 @@ print(f"Average time with compile: {compile_time:.4f}s")
|
|
| 464 |
|
| 465 |
speedup = no_compile_time / compile_time
|
| 466 |
print(f"\nSpeedup: {speedup:.2f}x")
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
### Pipeline usage
|
| 470 |
-
|
| 471 |
-
The model can be used as a pipeline, but you will have to define your own methods for parsing the raw output.
|
| 472 |
-
|
| 473 |
-
```python
|
| 474 |
-
from transformers import pipeline
|
| 475 |
|
| 476 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 477 |
pipe = pipeline("any-to-any", model=model_id, device_map="auto")
|
|
@@ -482,7 +408,7 @@ chat_template = [
|
|
| 482 |
{"type": "text", "text": "About VibeVoice"},
|
| 483 |
{
|
| 484 |
"type": "audio",
|
| 485 |
-
"path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
|
| 486 |
},
|
| 487 |
],
|
| 488 |
}
|
|
@@ -498,47 +424,6 @@ print(outputs)
|
|
| 498 |
============================================================
|
| 499 |
RAW PIPELINE OUTPUT
|
| 500 |
============================================================
|
| 501 |
-
[{'input_text': [{'role': 'user', 'content': [{'type': 'text', 'text': 'About VibeVoice'}, {'type': 'audio', 'path': 'https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav'}]}], 'generated_text': 'assistant\n[{"Start":0.0,"End":7.56,"Speaker":0,"Content":"VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker conversational audio."}]\n'}]
|
| 502 |
"""
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
## Evaluation
|
| 506 |
-
|
| 507 |
-
Below are results from the [technical report](https://arxiv.org/pdf/2601.18184).
|
| 508 |
-
|
| 509 |
-
<p align="center">
|
| 510 |
-
<img src="figures/DER.jpg" alt="DER" width="70%">
|
| 511 |
-
<img src="figures/cpWER.jpg" alt="cpWER" width="70%">
|
| 512 |
-
<img src="figures/tcpWER.jpg" alt="tcpWER" width="70%">
|
| 513 |
-
</p>
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
### Open ASR Leaderboard
|
| 517 |
-
|
| 518 |
-
On the [Open ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), the following results were obtained:
|
| 519 |
-
|
| 520 |
-
| Dataset | WER (%) |
|
| 521 |
-
| ---------------------- | -------- |
|
| 522 |
-
| ami_test | 17.20 |
|
| 523 |
-
| earnings22_test | 13.17 |
|
| 524 |
-
| gigaspeech_test | 9.67 |
|
| 525 |
-
| librispeech_test.clean | 2.20 |
|
| 526 |
-
| librispeech_test.other | 5.51 |
|
| 527 |
-
| spgispeech_test | 3.80 |
|
| 528 |
-
| tedlium_test | 2.57 |
|
| 529 |
-
| voxpopuli_test | 8.01 |
|
| 530 |
-
| **Average** | **7.77** |
|
| 531 |
-
| **RTFx** | **51.80** |
|
| 532 |
-
|
| 533 |
-
## Language Distribution
|
| 534 |
-
<p align="center">
|
| 535 |
-
<img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%">
|
| 536 |
-
</p>
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
## License
|
| 540 |
-
This project is licensed under the MIT License.
|
| 541 |
-
|
| 542 |
-
## Contact
|
| 543 |
-
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
|
| 544 |
-
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|
|
|
|
| 64 |
|
| 65 |
## VibeVoice-ASR (Transformers-compatible version)
|
| 66 |
[](https://github.com/microsoft/VibeVoice)
|
| 67 |
+
[](https://huggingface.co/docs/microsoft-azure/foundry/guides/vibevoice-asr)
|
| 68 |
[](https://aka.ms/vibevoice-asr)
|
| 69 |
[](https://arxiv.org/pdf/2601.18184)
|
| 70 |
|
| 71 |
**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
|
| 72 |
|
| 73 |
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
|
| 74 |
+
➡️ **Deploy:** [Deploy VibeVoice ASR on Microsoft Foundry: Long-Form Transcription, +50 Languages Supported & Speaker Diarization](https://huggingface.co/docs/microsoft-azure/foundry/guides/vibevoice-asr)<br>
|
| 75 |
➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br>
|
| 76 |
|
| 77 |
<p align="left">
|
|
|
|
| 101 |
|
| 102 |
VibeVoice-ASR is available as of v5.3.0 of Transformers!
|
| 103 |
|
| 104 |
+
```bash
|
| 105 |
pip install "transformers>=5.3.0"
|
| 106 |
+
Loading modelPythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 109 |
processor = AutoProcessor.from_pretrained(model_id)
|
| 110 |
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id)
|
| 111 |
+
Speaker-timestamped transcriptionA notable feature of VibeVoice ASR is its ability to transcribe multi-speaker content, denoting who spoke and when.The example below transcribes the following audio.<audio controls><source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav" type="audio/wav"></audio>Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 114 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
|
|
| 117 |
|
| 118 |
# Prepare inputs using `apply_transcription_request`
|
| 119 |
inputs = processor.apply_transcription_request(
|
| 120 |
+
audio="[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)",
|
| 121 |
).to(model.device, model.dtype)
|
| 122 |
|
| 123 |
# Apply model
|
|
|
|
| 164 |
============================================================
|
| 165 |
Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it.
|
| 166 |
"""
|
| 167 |
+
Providing contextIt is also possible to provide context. This can be useful if certain words cannot be transcribed correctly, such as proper nouns.Below we transcribe an audio where the speaker (with a German accent) talks about VibeVoice, comparing with and without the context "About VibeVoice".<audio controls><source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav" type="audio/wav"></audio>Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 170 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
|
|
| 173 |
|
| 174 |
# Without context
|
| 175 |
inputs = processor.apply_transcription_request(
|
| 176 |
+
audio="[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 177 |
).to(model.device, model.dtype)
|
| 178 |
output_ids = model.generate(**inputs)
|
| 179 |
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
|
|
|
|
| 182 |
|
| 183 |
# With context
|
| 184 |
inputs = processor.apply_transcription_request(
|
| 185 |
+
audio="[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 186 |
prompt="About VibeVoice",
|
| 187 |
).to(model.device, model.dtype)
|
| 188 |
output_ids = model.generate(**inputs)
|
|
|
|
| 194 |
WITHOUT CONTEXT: Revevoices is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio.
|
| 195 |
WITH CONTEXT : VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.
|
| 196 |
"""
|
| 197 |
+
Batch inferenceBatch inference is possible by passing a list of audio and (if provided) a list of prompts of equal length.Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 200 |
audio = [
|
| 201 |
+
"[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 202 |
+
"[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)"
|
| 203 |
]
|
| 204 |
prompts = ["About VibeVoice", None]
|
| 205 |
|
|
|
|
| 213 |
transcription = processor.decode(generated_ids, return_format="transcription_only")
|
| 214 |
|
| 215 |
print(transcription)
|
| 216 |
+
Adjusting tokenizer chunk (e.g. if out-of-memory)A key feature of VibeVoice ASR is that it can transcribe up to 60 minutes of continuous audio. This is done by chunking audio into 60-second segments (1440000 samples at 24kHz) and caching the convolution states between each segment.However, if chunks of 60 seconds are too large for your device, the tokenizer_chunk_size argument passed to generate can be adjusted. Note it should be a multiple of the hop length (3200 for the original acoustic tokenizer).Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
|
| 218 |
tokenizer_chunk_size = 64000 # default is 1440000 (60s @ 24kHz)
|
| 219 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 220 |
audio = [
|
| 221 |
+
"[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 222 |
+
"[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)"
|
| 223 |
]
|
| 224 |
prompts = ["About VibeVoice", None]
|
| 225 |
|
|
|
|
| 232 |
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
|
| 233 |
transcription = processor.decode(generated_ids, return_format="transcription_only")
|
| 234 |
print(transcription)
|
| 235 |
+
Chat templateVibeVoice ASR also accepts chat template inputs (apply_transcription_request is actually a wrapper for apply_chat_template for convenience):Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
|
| 237 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 238 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
|
|
| 246 |
{"type": "text", "text": "About VibeVoice"},
|
| 247 |
{
|
| 248 |
"type": "audio",
|
| 249 |
+
"path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 250 |
},
|
| 251 |
],
|
| 252 |
}
|
|
|
|
| 257 |
"content": [
|
| 258 |
{
|
| 259 |
"type": "audio",
|
| 260 |
+
"path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)",
|
| 261 |
},
|
| 262 |
],
|
| 263 |
}
|
|
|
|
| 274 |
generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
|
| 275 |
transcription = processor.decode(generated_ids, return_format="transcription_only")
|
| 276 |
print(transcription)
|
| 277 |
+
TrainingVibeVoice ASR can be trained with the loss outputted by the model.Pythonfrom transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
|
| 279 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 280 |
processor = AutoProcessor.from_pretrained(model_id)
|
|
|
|
| 291 |
{"type": "text", "text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio."},
|
| 292 |
{
|
| 293 |
"type": "audio",
|
| 294 |
+
"path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 295 |
},
|
| 296 |
],
|
| 297 |
}
|
|
|
|
| 303 |
{"type": "text", "text": "Hello everyone and welcome to the VibeVoice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."},
|
| 304 |
{
|
| 305 |
"type": "audio",
|
| 306 |
+
"path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav)",
|
| 307 |
},
|
| 308 |
],
|
| 309 |
}
|
|
|
|
| 319 |
loss = model(**inputs).loss
|
| 320 |
print("Loss:", loss.item())
|
| 321 |
loss.backward()
|
| 322 |
+
Torch compileThe model can be compiled for faster inference/training.Pythonimport time
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 323 |
import torch
|
| 324 |
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
|
| 325 |
|
|
|
|
| 344 |
},
|
| 345 |
{
|
| 346 |
"type": "audio",
|
| 347 |
+
"path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 348 |
},
|
| 349 |
],
|
| 350 |
}
|
|
|
|
| 397 |
|
| 398 |
speedup = no_compile_time / compile_time
|
| 399 |
print(f"\nSpeedup: {speedup:.2f}x")
|
| 400 |
+
Pipeline usageThe model can be used as a pipeline, but you will have to define your own methods for parsing the raw output.Pythonfrom transformers import pipeline
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 401 |
|
| 402 |
model_id = "microsoft/VibeVoice-ASR-HF"
|
| 403 |
pipe = pipeline("any-to-any", model=model_id, device_map="auto")
|
|
|
|
| 408 |
{"type": "text", "text": "About VibeVoice"},
|
| 409 |
{
|
| 410 |
"type": "audio",
|
| 411 |
+
"path": "[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)",
|
| 412 |
},
|
| 413 |
],
|
| 414 |
}
|
|
|
|
| 424 |
============================================================
|
| 425 |
RAW PIPELINE OUTPUT
|
| 426 |
============================================================
|
| 427 |
+
[{'input_text': [{'role': 'user', 'content': [{'type': 'text', 'text': 'About VibeVoice'}, {'type': 'audio', 'path': '[https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav](https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav)'}]}], 'generated_text': 'assistant\n[{"Start":0.0,"End":7.56,"Speaker":0,"Content":"VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker conversational audio."}]\n'}]
|
| 428 |
"""
|
| 429 |
+
EvaluationBelow are results from the technical report.<p align="center"><img src="figures/DER.jpg" alt="DER" width="70%"><img src="figures/cpWER.jpg" alt="cpWER" width="70%"><img src="figures/tcpWER.jpg" alt="tcpWER" width="70%"></p>Open ASR LeaderboardOn the Open ASR leaderboard, the following results were obtained:DatasetWER (%)ami_test17.20earnings22_test13.17gigaspeech_test9.67librispeech_test.clean2.20librispeech_test.other5.51spgispeech_test3.80tedlium_test2.57voxpopuli_test8.01Average7.77RTFx51.80Language Distribution<p align="center"><img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%"></p>LicenseThis project is licensed under the MIT License.ContactThis project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|