AEmotionStudio
/

vibevoice-models

+---
+license: mit
+language:
+  - en
+  - zh
+tags:
+  - text-to-speech
+  - speech-recognition
+  - tts
+  - asr
+  - voice-cloning
+  - long-form
+  - multi-speaker
+  - streaming
+  - mirror
+pipeline_tag: text-to-speech
+library_name: transformers
+---
+# VibeVoice — AEmotion Studio Mirror
+This repository is a **MAESTRO-curated mirror** of Microsoft's [VibeVoice](https://github.com/microsoft/VibeVoice) family, with the long-form-TTS inference code restored from the [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) fork. All weights, code, and assets remain under the upstream **MIT License**.
+It exists so MAESTRO's downloader can fetch each variant (and its dependencies) from a single, predictably-laid-out repo with `allow_patterns` filtering, instead of spreading across three separate Microsoft HF repos plus GitHub.
+## Layout
+```
+vibevoice-models/
+├── tts-1.5b/           ← microsoft/VibeVoice-1.5B  (5.4 GB)
+│   ├── config.json
+│   ├── preprocessor_config.json
+│   ├── model-0000{1..3}-of-00003.safetensors
+│   └── …
+├── asr-7b/             ← microsoft/VibeVoice-ASR    (17.4 GB, legacy/research variant)
+│   ├── config.json
+│   ├── model-0000{1..8}-of-00008.safetensors
+│   └── …
+└── realtime-0.5b/      ← microsoft/VibeVoice-Realtime-0.5B  (2.0 GB + 100 MB voices)
+    ├── config.json
+    ├── preprocessor_config.json
+    ├── model.safetensors
+    └── voices/         ← 25 baked-in voice presets (KV-cache .pt files, NOT model weights)
+        ├── en-Carter_man.pt
+        ├── en-Frank_man.pt
+        ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
+```
+### About `realtime-0.5b/voices/*.pt`
+These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
+### About `asr-7b/`
+This is the legacy research variant of VibeVoice ASR — the one that runs cleanly on `transformers>=4.51.3,<5.0.0`. Microsoft also publishes a `microsoft/VibeVoice-ASR-HF` repo with the cleaner `apply_transcription_request` API, but that variant requires `transformers>=5.3.0` which is not yet compatible with the rest of MAESTRO's model stack.
+## Variant capabilities
+| Variant | Task | Languages | Max length | Notes |
+|---|---|---|---|---|
+| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags |
+| `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
+| `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
+## Responsible use (verbatim from upstream model cards)
+> **VibeVoice is limited to research-purpose use exploring highly realistic audio dialogue generation.**
+>
+> The following are explicitly out of scope:
+> - Voice impersonation without explicit, recorded consent
+> - Disinformation or impersonation
+> - Real-time or low-latency voice conversion for live deep-fakes
+> - Generation in unsupported languages (non-English, non-Chinese)
+> - Generation of background ambience, Foley, or music
+> - Circumventing the watermark or audible disclaimer
+>
+> **We do not recommend using VibeVoice in commercial or real-world applications without further testing and development.** This model is intended for research and development purposes only.
+To mitigate misuse, Microsoft has:
+- Embedded an **audible disclaimer** ("This segment was generated by AI") in TTS outputs.
+- Added an **imperceptible perceptual watermark** to all generated audio.
+- Logged inference requests (hashed) for abuse-pattern detection.
+These mitigations are baked into the released weights and are preserved in this mirror.
+## Attribution
+| Component | Source | License |
+|---|---|---|
+| Model weights | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
+| Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
+| Inference code (TTS-1.5B) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
+## Citation
+```bibtex
+@misc{peng2025vibevoicetechnicalreport,
+  title  = {VibeVoice Technical Report},
+  author = {Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and
+            Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and
+            Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei},
+  year   = {2025},
+  eprint = {2508.19205},
+  archivePrefix = {arXiv},
+  primaryClass  = {cs.CL},
+  url    = {https://arxiv.org/abs/2508.19205}
+}
+```