| | --- |
| | language: |
| | - en |
| | - zh |
| | - es |
| | - pt |
| | - de |
| | - ja |
| | - ko |
| | - fr |
| | - ru |
| | - id |
| | - sv |
| | - it |
| | - he |
| | - nl |
| | - pl |
| | - no |
| | - tr |
| | - th |
| | - ar |
| | - hu |
| | - ca |
| | - cs |
| | - da |
| | - fa |
| | - af |
| | - hi |
| | - fi |
| | - et |
| | - aa |
| | - el |
| | - ro |
| | - vi |
| | - bg |
| | - is |
| | - sl |
| | - sk |
| | - lt |
| | - sw |
| | - uk |
| | - kl |
| | - lv |
| | - hr |
| | - ne |
| | - sr |
| | - tl |
| | - yi |
| | - ms |
| | - ur |
| | - mn |
| | - hy |
| | - jv |
| | license: mit |
| | pipeline_tag: automatic-speech-recognition |
| | tags: |
| | - ASR |
| | - Transcription |
| | - Diarization |
| | - Speech-to-Text |
| | library_name: transformers |
| | --- |
| | |
| |
|
| | ## VibeVoice-ASR |
| | [](https://github.com/microsoft/VibeVoice) |
| | [](https://aka.ms/vibevoice-asr) |
| | [](https://arxiv.org/pdf/2601.18184) |
| |
|
| | **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**. |
| |
|
| | ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br> |
| | ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br> |
| | ➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br> |
| | ➡️ **Finetuning:** [Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)<br> |
| | ➡️ **vLLM:** [vLLM-VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md)<br> |
| |
|
| | <p align="left"> |
| | <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px"> |
| | </p> |
| |
|
| |
|
| | ## 🔥 Key Features |
| |
|
| |
|
| | - **🕒 60-minute Single-Pass Processing**: |
| | Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour. |
| |
|
| | - **👤 Customized Hotwords**: |
| | Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content. |
| |
|
| | - **📝 Rich Transcription (Who, When, What)**: |
| | The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*. |
| | |
| | - **🌍 Multilingual & Code-Switching Support**: |
| | It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution). |
| |
|
| |
|
| |
|
| | ## Evaluation |
| | <p align="center"> |
| | <img src="figures/DER.jpg" alt="DER" width="70%"> |
| | <img src="figures/cpWER.jpg" alt="cpWER" width="70%"> |
| | <img src="figures/tcpWER.jpg" alt="tcpWER" width="70%"> |
| | </p> |
| |
|
| | ## Installation and Usage |
| |
|
| | Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation). |
| |
|
| | ## Language Distribution |
| | <p align="center"> |
| | <img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%"> |
| | </p> |
| |
|
| | ## License |
| | This project is licensed under the MIT License. |
| |
|
| | ## Contact |
| | This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com. |
| | If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations. |