Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,56 @@
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
-
- en
|
| 4 |
-
- zh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
license: mit
|
| 6 |
pipeline_tag: automatic-speech-recognition
|
| 7 |
tags:
|
|
@@ -15,12 +64,17 @@ library_name: transformers
|
|
| 15 |
|
| 16 |
## VibeVoice-ASR
|
| 17 |
[](https://github.com/microsoft/VibeVoice)
|
| 18 |
-
[](https://aka.ms/vibevoice-asr
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
|
| 21 |
|
| 22 |
➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
|
| 23 |
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
<p align="left">
|
| 26 |
<img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
|
|
@@ -29,6 +83,7 @@ library_name: transformers
|
|
| 29 |
|
| 30 |
## 🔥 Key Features
|
| 31 |
|
|
|
|
| 32 |
- **🕒 60-minute Single-Pass Processing**:
|
| 33 |
Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
|
| 34 |
|
|
@@ -37,7 +92,9 @@ library_name: transformers
|
|
| 37 |
|
| 38 |
- **📝 Rich Transcription (Who, When, What)**:
|
| 39 |
The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
|
| 40 |
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
|
| 43 |
|
|
@@ -52,6 +109,7 @@ library_name: transformers
|
|
| 52 |
|
| 53 |
Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
|
| 54 |
|
|
|
|
| 55 |
## License
|
| 56 |
This project is licensed under the MIT License.
|
| 57 |
|
|
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
+
- en # English
|
| 4 |
+
- zh # Chinese
|
| 5 |
+
- es # Spanish
|
| 6 |
+
- pt # Portuguese
|
| 7 |
+
- de # German
|
| 8 |
+
- ja # Japanese
|
| 9 |
+
- ko # Korean
|
| 10 |
+
- fr # French
|
| 11 |
+
- ru # Russian
|
| 12 |
+
- id # Indonesian
|
| 13 |
+
- sv # Swedish
|
| 14 |
+
- it # Italian
|
| 15 |
+
- he # Hebrew
|
| 16 |
+
- nl # Dutch
|
| 17 |
+
- pl # Polish
|
| 18 |
+
- no # Norwegian
|
| 19 |
+
- tr # Turkish
|
| 20 |
+
- th # Thai
|
| 21 |
+
- ar # Arabic
|
| 22 |
+
- hu # Hungarian
|
| 23 |
+
- ca # Catalan
|
| 24 |
+
- cs # Czech
|
| 25 |
+
- da # Danish
|
| 26 |
+
- fa # Persian
|
| 27 |
+
- af # Afrikaans
|
| 28 |
+
- hi # Hindi
|
| 29 |
+
- fi # Finnish
|
| 30 |
+
- et # Estonian
|
| 31 |
+
- aa # Afar
|
| 32 |
+
- el # Greek
|
| 33 |
+
- ro # Romanian
|
| 34 |
+
- vi # Vietnamese
|
| 35 |
+
- bg # Bulgarian
|
| 36 |
+
- is # Icelandic
|
| 37 |
+
- sl # Slovenian
|
| 38 |
+
- sk # Slovak
|
| 39 |
+
- lt # Lithuanian
|
| 40 |
+
- sw # Swahili
|
| 41 |
+
- uk # Ukrainian
|
| 42 |
+
- kl # Kalaallisut
|
| 43 |
+
- lv # Latvian
|
| 44 |
+
- hr # Croatian
|
| 45 |
+
- ne # Nepali
|
| 46 |
+
- sr # Serbian
|
| 47 |
+
- tl # Filipino (ISO 639-1; 常见工程别名: fil)
|
| 48 |
+
- yi # Yiddish
|
| 49 |
+
- ms # Malay
|
| 50 |
+
- ur # Urdu
|
| 51 |
+
- mn # Mongolian
|
| 52 |
+
- hy # Armenian
|
| 53 |
+
- jv # Javanese
|
| 54 |
license: mit
|
| 55 |
pipeline_tag: automatic-speech-recognition
|
| 56 |
tags:
|
|
|
|
| 64 |
|
| 65 |
## VibeVoice-ASR
|
| 66 |
[](https://github.com/microsoft/VibeVoice)
|
| 67 |
+
[](https://aka.ms/vibevoice-asr
|
| 68 |
+
[]
|
| 69 |
+
[]
|
| 70 |
|
| 71 |
+
**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
|
| 72 |
|
| 73 |
➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
|
| 74 |
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
|
| 75 |
+
➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)
|
| 76 |
+
➡️ **Finetuning:** [Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)
|
| 77 |
+
➡️ **vLLM:** [vLLM-VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md)
|
| 78 |
|
| 79 |
<p align="left">
|
| 80 |
<img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
|
|
|
|
| 83 |
|
| 84 |
## 🔥 Key Features
|
| 85 |
|
| 86 |
+
|
| 87 |
- **🕒 60-minute Single-Pass Processing**:
|
| 88 |
Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
|
| 89 |
|
|
|
|
| 92 |
|
| 93 |
- **📝 Rich Transcription (Who, When, What)**:
|
| 94 |
The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
|
| 95 |
+
|
| 96 |
+
- **🌍 Multilingual & Code-Switching Support**:
|
| 97 |
+
It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution).
|
| 98 |
|
| 99 |
|
| 100 |
|
|
|
|
| 109 |
|
| 110 |
Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
|
| 111 |
|
| 112 |
+
|
| 113 |
## License
|
| 114 |
This project is licensed under the MIT License.
|
| 115 |
|