microsoft
/

VibeVoice-1.5B

text-generation

Model card Files Files and versions

weiruan

#11

by Gao8 - opened Aug 27, 2025

base: refs/heads/main

←

from: refs/pr/11

Discussion Files changed

This PR is in draft mode

Files changed (1) hide show

README.md +5 -6

README.md CHANGED Viewed

@@ -1,12 +1,11 @@
 ---
 language:
 - en
 - zh
-license: mit
 pipeline_tag: text-to-speech
 tags:
 - Podcast
-library_name: transformers
 ---
 ## VibeVoice: A Frontier Open-Source Text-to-Speech Model
@@ -27,7 +26,7 @@ The model can synthesize speech up to **90 minutes** long with up to **4 distinc
   <img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
 </p>
-## Training Details
 Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
 - LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
 - Tokenizers:
@@ -43,9 +42,9 @@ Transformer-based Large Language Model (LLM) integrated with specialized acousti
 ## Models
 | Model | Context Length | Generation Length |  Weight |
 |-------|----------------|----------|----------|
-| VibeVoice-0.5B-Streaming | - | - | [HF link](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) |
 | VibeVoice-1.5B | 64K | ~90 min | You are here. |
-| VibeVoice-Large| 32K | ~45 min | Disabled |
 ## Installation and Usage
@@ -53,7 +52,7 @@ Please refer to [GitHub README](https://github.com/microsoft/VibeVoice?tab=readm
 ## Responsible Usage
 ### Direct intended uses
-The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://arxiv.org/pdf/2508.19205).
 ### Out-of-scope uses
 Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:

 ---
+license: mit
 language:
 - en
 - zh
 pipeline_tag: text-to-speech
 tags:
 - Podcast
 ---
 ## VibeVoice: A Frontier Open-Source Text-to-Speech Model
   <img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
 </p>
+## Training details
 Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
 - LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
 - Tokenizers:
 ## Models
 | Model | Context Length | Generation Length |  Weight |
 |-------|----------------|----------|----------|
+| VibeVoice-0.5B-Streaming | - | - | On the way |
 | VibeVoice-1.5B | 64K | ~90 min | You are here. |
+| VibeVoice-7B-Preview| 32K | ~45 min | [HF link](https://huggingface.co/WestZhang/VibeVoice-Large-pt) |
 ## Installation and Usage
 ## Responsible Usage
 ### Direct intended uses
+The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://github.com/microsoft/VibeVoice/blob/main/report/TechnicalReport.pdf).
 ### Out-of-scope uses
 Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios: