Text-to-Speech
Transformers
Safetensors
English
Chinese
speech-recognition
tts
asr
voice-cloning
long-form
multi-speaker
streaming
mirror
Instructions to use AEmotionStudio/vibevoice-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEmotionStudio/vibevoice-models with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="AEmotionStudio/vibevoice-models")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AEmotionStudio/vibevoice-models", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs: add README — layout, RAI, attribution
Browse files
README.md
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
tags:
|
| 7 |
+
- text-to-speech
|
| 8 |
+
- speech-recognition
|
| 9 |
+
- tts
|
| 10 |
+
- asr
|
| 11 |
+
- voice-cloning
|
| 12 |
+
- long-form
|
| 13 |
+
- multi-speaker
|
| 14 |
+
- streaming
|
| 15 |
+
- mirror
|
| 16 |
+
pipeline_tag: text-to-speech
|
| 17 |
+
library_name: transformers
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# VibeVoice — AEmotion Studio Mirror
|
| 21 |
+
|
| 22 |
+
This repository is a **MAESTRO-curated mirror** of Microsoft's [VibeVoice](https://github.com/microsoft/VibeVoice) family, with the long-form-TTS inference code restored from the [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) fork. All weights, code, and assets remain under the upstream **MIT License**.
|
| 23 |
+
|
| 24 |
+
It exists so MAESTRO's downloader can fetch each variant (and its dependencies) from a single, predictably-laid-out repo with `allow_patterns` filtering, instead of spreading across three separate Microsoft HF repos plus GitHub.
|
| 25 |
+
|
| 26 |
+
## Layout
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
vibevoice-models/
|
| 30 |
+
├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB)
|
| 31 |
+
│ ├── config.json
|
| 32 |
+
│ ├── preprocessor_config.json
|
| 33 |
+
│ ├── model-0000{1..3}-of-00003.safetensors
|
| 34 |
+
│ └── …
|
| 35 |
+
├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant)
|
| 36 |
+
│ ├── config.json
|
| 37 |
+
│ ├── model-0000{1..8}-of-00008.safetensors
|
| 38 |
+
│ └── …
|
| 39 |
+
└── realtime-0.5b/ ← microsoft/VibeVoice-Realtime-0.5B (2.0 GB + 100 MB voices)
|
| 40 |
+
├── config.json
|
| 41 |
+
├── preprocessor_config.json
|
| 42 |
+
├── model.safetensors
|
| 43 |
+
└── voices/ ← 25 baked-in voice presets (KV-cache .pt files, NOT model weights)
|
| 44 |
+
├── en-Carter_man.pt
|
| 45 |
+
├── en-Frank_man.pt
|
| 46 |
+
├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### About `realtime-0.5b/voices/*.pt`
|
| 50 |
+
|
| 51 |
+
These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
|
| 52 |
+
|
| 53 |
+
### About `asr-7b/`
|
| 54 |
+
|
| 55 |
+
This is the legacy research variant of VibeVoice ASR — the one that runs cleanly on `transformers>=4.51.3,<5.0.0`. Microsoft also publishes a `microsoft/VibeVoice-ASR-HF` repo with the cleaner `apply_transcription_request` API, but that variant requires `transformers>=5.3.0` which is not yet compatible with the rest of MAESTRO's model stack.
|
| 56 |
+
|
| 57 |
+
## Variant capabilities
|
| 58 |
+
|
| 59 |
+
| Variant | Task | Languages | Max length | Notes |
|
| 60 |
+
|---|---|---|---|---|
|
| 61 |
+
| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags |
|
| 62 |
+
| `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
|
| 63 |
+
| `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
|
| 64 |
+
|
| 65 |
+
## Responsible use (verbatim from upstream model cards)
|
| 66 |
+
|
| 67 |
+
> **VibeVoice is limited to research-purpose use exploring highly realistic audio dialogue generation.**
|
| 68 |
+
>
|
| 69 |
+
> The following are explicitly out of scope:
|
| 70 |
+
> - Voice impersonation without explicit, recorded consent
|
| 71 |
+
> - Disinformation or impersonation
|
| 72 |
+
> - Real-time or low-latency voice conversion for live deep-fakes
|
| 73 |
+
> - Generation in unsupported languages (non-English, non-Chinese)
|
| 74 |
+
> - Generation of background ambience, Foley, or music
|
| 75 |
+
> - Circumventing the watermark or audible disclaimer
|
| 76 |
+
>
|
| 77 |
+
> **We do not recommend using VibeVoice in commercial or real-world applications without further testing and development.** This model is intended for research and development purposes only.
|
| 78 |
+
|
| 79 |
+
To mitigate misuse, Microsoft has:
|
| 80 |
+
- Embedded an **audible disclaimer** ("This segment was generated by AI") in TTS outputs.
|
| 81 |
+
- Added an **imperceptible perceptual watermark** to all generated audio.
|
| 82 |
+
- Logged inference requests (hashed) for abuse-pattern detection.
|
| 83 |
+
|
| 84 |
+
These mitigations are baked into the released weights and are preserved in this mirror.
|
| 85 |
+
|
| 86 |
+
## Attribution
|
| 87 |
+
|
| 88 |
+
| Component | Source | License |
|
| 89 |
+
|---|---|---|
|
| 90 |
+
| Model weights | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
|
| 91 |
+
| Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
|
| 92 |
+
| Inference code (TTS-1.5B) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
|
| 93 |
+
|
| 94 |
+
## Citation
|
| 95 |
+
|
| 96 |
+
```bibtex
|
| 97 |
+
@misc{peng2025vibevoicetechnicalreport,
|
| 98 |
+
title = {VibeVoice Technical Report},
|
| 99 |
+
author = {Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and
|
| 100 |
+
Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and
|
| 101 |
+
Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei},
|
| 102 |
+
year = {2025},
|
| 103 |
+
eprint = {2508.19205},
|
| 104 |
+
archivePrefix = {arXiv},
|
| 105 |
+
primaryClass = {cs.CL},
|
| 106 |
+
url = {https://arxiv.org/abs/2508.19205}
|
| 107 |
+
}
|
| 108 |
+
```
|