microsoft
/

VibeVoice-ASR

Automatic Speech Recognition

Model card Files Files and versions

YaoyaoChang commited on Jan 21

Commit

2d0b945

·

1 Parent(s): 4c769d9

update README

Files changed (1) hide show

README.md +6 -5

README.md CHANGED Viewed

@@ -29,14 +29,15 @@ library_name: transformers
 ## 🔥 Key Features
-- **🕒 60-min Single-Pass Processing**:
-  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
-- **👤 Optional Context Injection**:
-  Users can provide customized context (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
 - **📝 Rich Transcription (Who, When, What)**:
-  The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*.

 ## 🔥 Key Features
+- **🕒 60-minute Single-Pass Processing**:
+  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
+- **👤 Customized Hotwords**:
+  Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
 - **📝 Rich Transcription (Who, When, What)**:
+  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.