microsoft
/

VibeVoice-ASR

Automatic Speech Recognition

Model card Files Files and versions

YaoyaoChang commited on Jan 21

Commit

76324d4

·

1 Parent(s): e49f300

update README

Files changed (4) hide show

README.md +13 -6
figures/DER.jpg +0 -0
figures/cpWER.jpg +0 -0
figures/tcpWER.jpg +0 -0

README.md CHANGED Viewed

@@ -14,15 +14,15 @@ library_name: transformers
 ## VibeVoice-ASR
 [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
-**VibeVoice-ASR** is the latest addition to the **VibeVoice** family. While the original VibeVoice / VibeVoice-Realtime focused on expressive TTS, **VibeVoice-ASR** focuses on understanding long-form speech with high precision and rich metadata.
-It is a unified speech-to-text model designed to handle **1-hour long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **User-Customized Context**.
 ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
 <p align="left">
   <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
 </p>
@@ -39,11 +39,18 @@ It is a unified speech-to-text model designed to handle **1-hour long-form audio
 - **📝 Rich Transcription (Who, When, What)**:
   The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*.
-[Try it here.](https://aka.ms/vibevoice-asr)
 ## Installation and Usage
-Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation)
 ## License
 This project is licensed under the MIT License.

 ## VibeVoice-ASR
+[![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
 [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
+**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
 ➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
+➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
 <p align="left">
   <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
 </p>
 - **📝 Rich Transcription (Who, When, What)**:
   The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*.
+## Evaluation
+<p align="center">
+  <img src="figures/DER.jpg" alt="DER" width="50%">
+  <img src="figures/cpWER.jpg" alt="cpWER" width="50%">
+  <img src="figures/tcpWER.jpg" alt="tcpWER" width="50%">
+</p>
 ## Installation and Usage
+Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
 ## License
 This project is licensed under the MIT License.

figures/DER.jpg ADDED Viewed

figures/cpWER.jpg ADDED Viewed

figures/tcpWER.jpg ADDED Viewed