File size: 2,558 Bytes
d1e3119 e7a4ffe 086f35d d1e3119 086f35d e7a4ffe e49f300 76324d4 e49f300 76324d4 e7a4ffe 4c769d9 76324d4 e7a4ffe 2d0b945 e7a4ffe 2d0b945 e7a4ffe 2d0b945 e7a4ffe 76324d4 bfa0c6f 76324d4 e7a4ffe 76324d4 e7a4ffe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
---
language:
- en
- zh
license: mit
pipeline_tag: automatic-speech-recognition
tags:
- ASR
- Transcriptoin
- Diarization
- Speech-to-Text
library_name: transformers
---
## VibeVoice-ASR
[](https://github.com/microsoft/VibeVoice)
[](https://aka.ms/vibevoice-asr)
**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)<br>
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
<p align="left">
<img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
</p>
## 🔥 Key Features
- **🕒 60-minute Single-Pass Processing**:
Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
- **👤 Customized Hotwords**:
Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
- **📝 Rich Transcription (Who, When, What)**:
The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
## Evaluation
<p align="center">
<img src="figures/DER.jpg" alt="DER" width="70%">
<img src="figures/cpWER.jpg" alt="cpWER" width="70%">
<img src="figures/tcpWER.jpg" alt="tcpWER" width="70%">
</p>
## Installation and Usage
Please refer to [GitHub README](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md#installation).
## License
This project is licensed under the MIT License.
## Contact
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations. |