frontierai commited on
Commit
53552c2
·
verified ·
1 Parent(s): bfa0c6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -5
README.md CHANGED
@@ -1,7 +1,56 @@
1
  ---
2
  language:
3
- - en
4
- - zh
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  license: mit
6
  pipeline_tag: automatic-speech-recognition
7
  tags:
@@ -15,12 +64,17 @@ library_name: transformers
15
 
16
  ## VibeVoice-ASR
17
  [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
18
- [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
 
 
19
 
20
- **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
21
 
22
  ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
23
  ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
 
 
 
24
 
25
  <p align="left">
26
  <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
@@ -29,6 +83,7 @@ library_name: transformers
29
 
30
  ## 🔥 Key Features
31
 
 
32
  - **🕒 60-minute Single-Pass Processing**:
33
  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
34
 
@@ -37,7 +92,9 @@ library_name: transformers
37
 
38
  - **📝 Rich Transcription (Who, When, What)**:
39
  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
40
-
 
 
41
 
42
 
43
 
@@ -52,6 +109,7 @@ library_name: transformers
52
 
53
  Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
54
 
 
55
  ## License
56
  This project is licensed under the MIT License.
57
 
 
1
  ---
2
  language:
3
+ - en # English
4
+ - zh # Chinese
5
+ - es # Spanish
6
+ - pt # Portuguese
7
+ - de # German
8
+ - ja # Japanese
9
+ - ko # Korean
10
+ - fr # French
11
+ - ru # Russian
12
+ - id # Indonesian
13
+ - sv # Swedish
14
+ - it # Italian
15
+ - he # Hebrew
16
+ - nl # Dutch
17
+ - pl # Polish
18
+ - no # Norwegian
19
+ - tr # Turkish
20
+ - th # Thai
21
+ - ar # Arabic
22
+ - hu # Hungarian
23
+ - ca # Catalan
24
+ - cs # Czech
25
+ - da # Danish
26
+ - fa # Persian
27
+ - af # Afrikaans
28
+ - hi # Hindi
29
+ - fi # Finnish
30
+ - et # Estonian
31
+ - aa # Afar
32
+ - el # Greek
33
+ - ro # Romanian
34
+ - vi # Vietnamese
35
+ - bg # Bulgarian
36
+ - is # Icelandic
37
+ - sl # Slovenian
38
+ - sk # Slovak
39
+ - lt # Lithuanian
40
+ - sw # Swahili
41
+ - uk # Ukrainian
42
+ - kl # Kalaallisut
43
+ - lv # Latvian
44
+ - hr # Croatian
45
+ - ne # Nepali
46
+ - sr # Serbian
47
+ - tl # Filipino (ISO 639-1; 常见工程别名: fil)
48
+ - yi # Yiddish
49
+ - ms # Malay
50
+ - ur # Urdu
51
+ - mn # Mongolian
52
+ - hy # Armenian
53
+ - jv # Javanese
54
  license: mit
55
  pipeline_tag: automatic-speech-recognition
56
  tags:
 
64
 
65
  ## VibeVoice-ASR
66
  [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
67
+ [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr
68
+ [![Technical Report](https://arxiv.org/pdf/2601.18184)]
69
+ [![Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)]
70
 
71
+ **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
72
 
73
  ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
74
  ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
75
+ ➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)
76
+ ➡️ **Finetuning:** [Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)
77
+ ➡️ **vLLM:** [vLLM-VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md)
78
 
79
  <p align="left">
80
  <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
 
83
 
84
  ## 🔥 Key Features
85
 
86
+
87
  - **🕒 60-minute Single-Pass Processing**:
88
  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
89
 
 
92
 
93
  - **📝 Rich Transcription (Who, When, What)**:
94
  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
95
+
96
+ - **🌍 Multilingual & Code-Switching Support**:
97
+ It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution).
98
 
99
 
100
 
 
109
 
110
  Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
111
 
112
+
113
  ## License
114
  This project is licensed under the MIT License.
115