YaoyaoChang
commited on
Commit
·
76324d4
1
Parent(s):
e49f300
update README
Browse files- README.md +13 -6
- figures/DER.jpg +0 -0
- figures/cpWER.jpg +0 -0
- figures/tcpWER.jpg +0 -0
README.md
CHANGED
|
@@ -14,15 +14,15 @@ library_name: transformers
|
|
| 14 |
|
| 15 |
|
| 16 |
## VibeVoice-ASR
|
|
|
|
| 17 |
[](https://aka.ms/vibevoice-asr)
|
| 18 |
|
| 19 |
-
|
| 20 |
-
**VibeVoice-ASR** is the latest addition to the **VibeVoice** family. While the original VibeVoice / VibeVoice-Realtime focused on expressive TTS, **VibeVoice-ASR** focuses on understanding long-form speech with high precision and rich metadata.
|
| 21 |
-
|
| 22 |
-
It is a unified speech-to-text model designed to handle **1-hour long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **User-Customized Context**.
|
| 23 |
|
| 24 |
➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
|
| 25 |
|
|
|
|
|
|
|
| 26 |
<p align="left">
|
| 27 |
<img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
|
| 28 |
</p>
|
|
@@ -39,11 +39,18 @@ It is a unified speech-to-text model designed to handle **1-hour long-form audio
|
|
| 39 |
- **📝 Rich Transcription (Who, When, What)**:
|
| 40 |
The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*.
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Installation and Usage
|
| 45 |
|
| 46 |
-
Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation)
|
| 47 |
|
| 48 |
## License
|
| 49 |
This project is licensed under the MIT License.
|
|
|
|
| 14 |
|
| 15 |
|
| 16 |
## VibeVoice-ASR
|
| 17 |
+
[](https://github.com/microsoft/VibeVoice)
|
| 18 |
[](https://aka.ms/vibevoice-asr)
|
| 19 |
|
| 20 |
+
**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
|
| 23 |
|
| 24 |
+
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
|
| 25 |
+
|
| 26 |
<p align="left">
|
| 27 |
<img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
|
| 28 |
</p>
|
|
|
|
| 39 |
- **📝 Rich Transcription (Who, When, What)**:
|
| 40 |
The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*.
|
| 41 |
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
## Evaluation
|
| 45 |
+
<p align="center">
|
| 46 |
+
<img src="figures/DER.jpg" alt="DER" width="50%">
|
| 47 |
+
<img src="figures/cpWER.jpg" alt="cpWER" width="50%">
|
| 48 |
+
<img src="figures/tcpWER.jpg" alt="tcpWER" width="50%">
|
| 49 |
+
</p>
|
| 50 |
|
| 51 |
## Installation and Usage
|
| 52 |
|
| 53 |
+
Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
|
| 54 |
|
| 55 |
## License
|
| 56 |
This project is licensed under the MIT License.
|
figures/DER.jpg
ADDED
|
figures/cpWER.jpg
ADDED
|
figures/tcpWER.jpg
ADDED
|