File size: 4,345 Bytes
203029c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
language:
- en   # English
- zh   # Chinese
- es   # Spanish
- pt   # Portuguese
- de   # German
- ja   # Japanese
- ko   # Korean
- fr   # French
- ru   # Russian
- id   # Indonesian
- sv   # Swedish
- it   # Italian
- he   # Hebrew
- nl   # Dutch
- pl   # Polish
- no   # Norwegian
- tr   # Turkish
- th   # Thai
- ar   # Arabic
- hu   # Hungarian
- ca   # Catalan
- cs   # Czech
- da   # Danish
- fa   # Persian
- af   # Afrikaans
- hi   # Hindi
- fi   # Finnish
- et   # Estonian
- aa   # Afar
- el   # Greek
- ro   # Romanian
- vi   # Vietnamese
- bg   # Bulgarian
- is   # Icelandic
- sl   # Slovenian
- sk   # Slovak
- lt   # Lithuanian
- sw   # Swahili
- uk   # Ukrainian
- kl   # Kalaallisut
- lv   # Latvian
- hr   # Croatian
- ne   # Nepali
- sr   # Serbian
- tl   # Filipino (ISO 639-1; 常见工程别名: fil)
- yi   # Yiddish
- ms   # Malay
- ur   # Urdu
- mn   # Mongolian
- hy   # Armenian
- jv   # Javanese
license: mit
pipeline_tag: automatic-speech-recognition
tags:
- ASR
- Transcriptoin
- Diarization
- Speech-to-Text
library_name: transformers
---


## VibeVoice-ASR
[![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
[![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
[![Technical Report](https://img.shields.io/badge/arXiv-2601.18184-b31b1b?logo=arxiv)](https://arxiv.org/pdf/2601.18184)

**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.

➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br>
➡️ **Finetuning:** [Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)<br>
➡️ **vLLM:** [vLLM-VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md)<br>

<p align="left">
  <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
</p>


## 🔥 Key Features


- **🕒 60-minute Single-Pass Processing**:
  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.

- **👤 Customized Hotwords**:
  Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.

- **📝 Rich Transcription (Who, When, What)**:
  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
  
- **🌍 Multilingual & Code-Switching Support**:
  It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution).



## Evaluation
<p align="center">
  <img src="figures/DER.jpg" alt="DER" width="70%">
  <img src="figures/cpWER.jpg" alt="cpWER" width="70%">
  <img src="figures/tcpWER.jpg" alt="tcpWER" width="70%">
</p>

## Installation and Usage

Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).

## Language Distribution
<p align="center">
  <img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%">
</p>

## License
This project is licensed under the MIT License.

## Contact
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.