YaoyaoChang commited on
Commit
76324d4
·
1 Parent(s): e49f300

update README

Browse files
Files changed (4) hide show
  1. README.md +13 -6
  2. figures/DER.jpg +0 -0
  3. figures/cpWER.jpg +0 -0
  4. figures/tcpWER.jpg +0 -0
README.md CHANGED
@@ -14,15 +14,15 @@ library_name: transformers
14
 
15
 
16
  ## VibeVoice-ASR
 
17
  [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
18
 
19
-
20
- **VibeVoice-ASR** is the latest addition to the **VibeVoice** family. While the original VibeVoice / VibeVoice-Realtime focused on expressive TTS, **VibeVoice-ASR** focuses on understanding long-form speech with high precision and rich metadata.
21
-
22
- It is a unified speech-to-text model designed to handle **1-hour long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **User-Customized Context**.
23
 
24
  ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
25
 
 
 
26
  <p align="left">
27
  <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
28
  </p>
@@ -39,11 +39,18 @@ It is a unified speech-to-text model designed to handle **1-hour long-form audio
39
  - **📝 Rich Transcription (Who, When, What)**:
40
  The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*.
41
 
42
- [Try it here.](https://aka.ms/vibevoice-asr)
 
 
 
 
 
 
 
43
 
44
  ## Installation and Usage
45
 
46
- Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation)
47
 
48
  ## License
49
  This project is licensed under the MIT License.
 
14
 
15
 
16
  ## VibeVoice-ASR
17
+ [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
18
  [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
19
 
20
+ **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
 
 
 
21
 
22
  ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
23
 
24
+ ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
25
+
26
  <p align="left">
27
  <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
28
  </p>
 
39
  - **📝 Rich Transcription (Who, When, What)**:
40
  The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*.
41
 
42
+
43
+
44
+ ## Evaluation
45
+ <p align="center">
46
+ <img src="figures/DER.jpg" alt="DER" width="50%">
47
+ <img src="figures/cpWER.jpg" alt="cpWER" width="50%">
48
+ <img src="figures/tcpWER.jpg" alt="tcpWER" width="50%">
49
+ </p>
50
 
51
  ## Installation and Usage
52
 
53
+ Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
54
 
55
  ## License
56
  This project is licensed under the MIT License.
figures/DER.jpg ADDED
figures/cpWER.jpg ADDED
figures/tcpWER.jpg ADDED