Instructions to use microsoft/VibeVoice-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/VibeVoice-1.5B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="microsoft/VibeVoice-1.5B")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("microsoft/VibeVoice-1.5B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
weiruan
#11
by Gao8 - opened
README.md
CHANGED
|
@@ -1,12 +1,11 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- zh
|
| 5 |
-
license: mit
|
| 6 |
pipeline_tag: text-to-speech
|
| 7 |
tags:
|
| 8 |
- Podcast
|
| 9 |
-
library_name: transformers
|
| 10 |
---
|
| 11 |
|
| 12 |
## VibeVoice: A Frontier Open-Source Text-to-Speech Model
|
|
@@ -27,7 +26,7 @@ The model can synthesize speech up to **90 minutes** long with up to **4 distinc
|
|
| 27 |
<img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
|
| 28 |
</p>
|
| 29 |
|
| 30 |
-
## Training
|
| 31 |
Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
|
| 32 |
- LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
|
| 33 |
- Tokenizers:
|
|
@@ -43,9 +42,9 @@ Transformer-based Large Language Model (LLM) integrated with specialized acousti
|
|
| 43 |
## Models
|
| 44 |
| Model | Context Length | Generation Length | Weight |
|
| 45 |
|-------|----------------|----------|----------|
|
| 46 |
-
| VibeVoice-0.5B-Streaming | - | - |
|
| 47 |
| VibeVoice-1.5B | 64K | ~90 min | You are here. |
|
| 48 |
-
| VibeVoice-
|
| 49 |
|
| 50 |
## Installation and Usage
|
| 51 |
|
|
@@ -53,7 +52,7 @@ Please refer to [GitHub README](https://github.com/microsoft/VibeVoice?tab=readm
|
|
| 53 |
|
| 54 |
## Responsible Usage
|
| 55 |
### Direct intended uses
|
| 56 |
-
The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://
|
| 57 |
|
| 58 |
### Out-of-scope uses
|
| 59 |
Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- zh
|
|
|
|
| 6 |
pipeline_tag: text-to-speech
|
| 7 |
tags:
|
| 8 |
- Podcast
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
## VibeVoice: A Frontier Open-Source Text-to-Speech Model
|
|
|
|
| 26 |
<img src="figures/Fig1.png" alt="VibeVoice Overview" height="250px">
|
| 27 |
</p>
|
| 28 |
|
| 29 |
+
## Training details
|
| 30 |
Transformer-based Large Language Model (LLM) integrated with specialized acoustic and semantic tokenizers and a diffusion-based decoding head.
|
| 31 |
- LLM: [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) for this release.
|
| 32 |
- Tokenizers:
|
|
|
|
| 42 |
## Models
|
| 43 |
| Model | Context Length | Generation Length | Weight |
|
| 44 |
|-------|----------------|----------|----------|
|
| 45 |
+
| VibeVoice-0.5B-Streaming | - | - | On the way |
|
| 46 |
| VibeVoice-1.5B | 64K | ~90 min | You are here. |
|
| 47 |
+
| VibeVoice-7B-Preview| 32K | ~45 min | [HF link](https://huggingface.co/WestZhang/VibeVoice-Large-pt) |
|
| 48 |
|
| 49 |
## Installation and Usage
|
| 50 |
|
|
|
|
| 52 |
|
| 53 |
## Responsible Usage
|
| 54 |
### Direct intended uses
|
| 55 |
+
The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the [tech report](https://github.com/microsoft/VibeVoice/blob/main/report/TechnicalReport.pdf).
|
| 56 |
|
| 57 |
### Out-of-scope uses
|
| 58 |
Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:
|