arxiv:2601.18184

VIBEVOICE-ASR Technical Report

Published on Jan 26

· Submitted by

Furu Wei on Jan 27

Microsoft

Upvote

Authors:

Zhiliang Peng ,

Yaoyao Chang ,

Yingbo Hao ,

Wenhui Wang ,

Weijiang Xu ,

Zehua Wang ,

Zewen Chi ,

Abstract

VibeVoice-ASR is a unified end-to-end speech understanding framework that processes long-form audio in a single pass while supporting multilingual, code-switching, and domain-specific context injection.

AI-generated summary

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

View arXiv page View PDF Add to collection

Community

frontierai

Paper submitter about 16 hours ago

VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.18184 in a dataset README.md to link it from this page.

Spaces citing this paper 3

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.