File size: 4,345 Bytes
d1e3119
 
53552c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e7a4ffe
086f35d
d1e3119
086f35d
9d8a70d
086f35d
 
e7a4ffe
 
 
 
e49f300
76324d4
a7b3b7f
 
e49f300
53552c2
e7a4ffe
4c769d9
a7b3b7f
 
 
 
76324d4
e7a4ffe
 
 
 
 
 
 
53552c2
2d0b945
 
e7a4ffe
2d0b945
 
e7a4ffe
 
2d0b945
53552c2
 
 
e7a4ffe
76324d4
 
 
 
bfa0c6f
 
 
76324d4
e7a4ffe
 
 
76324d4
e7a4ffe
7f66f13
 
 
 
53552c2
e7a4ffe
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
language:
- en   # English
- zh   # Chinese
- es   # Spanish
- pt   # Portuguese
- de   # German
- ja   # Japanese
- ko   # Korean
- fr   # French
- ru   # Russian
- id   # Indonesian
- sv   # Swedish
- it   # Italian
- he   # Hebrew
- nl   # Dutch
- pl   # Polish
- no   # Norwegian
- tr   # Turkish
- th   # Thai
- ar   # Arabic
- hu   # Hungarian
- ca   # Catalan
- cs   # Czech
- da   # Danish
- fa   # Persian
- af   # Afrikaans
- hi   # Hindi
- fi   # Finnish
- et   # Estonian
- aa   # Afar
- el   # Greek
- ro   # Romanian
- vi   # Vietnamese
- bg   # Bulgarian
- is   # Icelandic
- sl   # Slovenian
- sk   # Slovak
- lt   # Lithuanian
- sw   # Swahili
- uk   # Ukrainian
- kl   # Kalaallisut
- lv   # Latvian
- hr   # Croatian
- ne   # Nepali
- sr   # Serbian
- tl   # Filipino (ISO 639-1; 常见工程别名: fil)
- yi   # Yiddish
- ms   # Malay
- ur   # Urdu
- mn   # Mongolian
- hy   # Armenian
- jv   # Javanese
license: mit
pipeline_tag: automatic-speech-recognition
tags:
- ASR
- Transcription
- Diarization
- Speech-to-Text
library_name: transformers
---


## VibeVoice-ASR
[![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
[![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
[![Technical Report](https://img.shields.io/badge/arXiv-2601.18184-b31b1b?logo=arxiv)](https://arxiv.org/pdf/2601.18184)

**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.

➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br>
➡️ **Finetuning:** [Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)<br>
➡️ **vLLM:** [vLLM-VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md)<br>

<p align="left">
  <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
</p>


## 🔥 Key Features


- **🕒 60-minute Single-Pass Processing**:
  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.

- **👤 Customized Hotwords**:
  Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.

- **📝 Rich Transcription (Who, When, What)**:
  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
  
- **🌍 Multilingual & Code-Switching Support**:
  It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution).



## Evaluation
<p align="center">
  <img src="figures/DER.jpg" alt="DER" width="70%">
  <img src="figures/cpWER.jpg" alt="cpWER" width="70%">
  <img src="figures/tcpWER.jpg" alt="tcpWER" width="70%">
</p>

## Installation and Usage

Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).

## Language Distribution
<p align="center">
  <img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%">
</p>

## License
This project is licensed under the MIT License.

## Contact
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.