microsoft
/

VibeVoice-ASR

@@ -1,7 +1,56 @@
 ---
 language:
-- en
-- zh
 license: mit
 pipeline_tag: automatic-speech-recognition
 tags:
@@ -15,12 +64,17 @@ library_name: transformers
 ## VibeVoice-ASR
 [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
-[![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
-**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
 ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
 ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
 <p align="left">
   <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
@@ -29,6 +83,7 @@ library_name: transformers
 ## 🔥 Key Features
 - **🕒 60-minute Single-Pass Processing**:
   Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
@@ -37,7 +92,9 @@ library_name: transformers
 - **📝 Rich Transcription (Who, When, What)**:
   The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
@@ -52,6 +109,7 @@ library_name: transformers
 Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
 ## License
 This project is licensed under the MIT License.

 ---
 language:
+- en   # English
+- zh   # Chinese
+- es   # Spanish
+- pt   # Portuguese
+- de   # German
+- ja   # Japanese
+- ko   # Korean
+- fr   # French
+- ru   # Russian
+- id   # Indonesian
+- sv   # Swedish
+- it   # Italian
+- he   # Hebrew
+- nl   # Dutch
+- pl   # Polish
+- no   # Norwegian
+- tr   # Turkish
+- th   # Thai
+- ar   # Arabic
+- hu   # Hungarian
+- ca   # Catalan
+- cs   # Czech
+- da   # Danish
+- fa   # Persian
+- af   # Afrikaans
+- hi   # Hindi
+- fi   # Finnish
+- et   # Estonian
+- aa   # Afar
+- el   # Greek
+- ro   # Romanian
+- vi   # Vietnamese
+- bg   # Bulgarian
+- is   # Icelandic
+- sl   # Slovenian
+- sk   # Slovak
+- lt   # Lithuanian
+- sw   # Swahili
+- uk   # Ukrainian
+- kl   # Kalaallisut
+- lv   # Latvian
+- hr   # Croatian
+- ne   # Nepali
+- sr   # Serbian
+- tl   # Filipino (ISO 639-1; 常见工程别名: fil)
+- yi   # Yiddish
+- ms   # Malay
+- ur   # Urdu
+- mn   # Mongolian
+- hy   # Armenian
+- jv   # Javanese
 license: mit
 pipeline_tag: automatic-speech-recognition
 tags:
 ## VibeVoice-ASR
 [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
+[![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr
+[![Technical Report](https://arxiv.org/pdf/2601.18184)]
+[![Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)]
+**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
 ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
 ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
+➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)
+➡️ **Finetuning:** [Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)
+➡️ **vLLM:** [vLLM-VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md)
 <p align="left">
   <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
 ## 🔥 Key Features
 - **🕒 60-minute Single-Pass Processing**:
   Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
 - **📝 Rich Transcription (Who, When, What)**:
   The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
+- **🌍 Multilingual & Code-Switching Support**:
+  It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution).
 Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
 ## License
 This project is licensed under the MIT License.