--- license: mit language: - en - zh tags: - text-to-speech - speech-recognition - tts - asr - voice-cloning - long-form - multi-speaker - streaming - mirror pipeline_tag: text-to-speech library_name: transformers --- # VibeVoice — AEmotion Studio Mirror This repository is a **MAESTRO-curated mirror** of Microsoft's [VibeVoice](https://github.com/microsoft/VibeVoice) family, with the long-form-TTS inference code restored from the [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) fork. All weights, code, and assets remain under the upstream **MIT License**. It exists so MAESTRO's downloader can fetch each variant (and its dependencies) from a single, predictably-laid-out repo with `allow_patterns` filtering, instead of spreading across three separate Microsoft HF repos plus GitHub. ## Layout ``` vibevoice-models/ ├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB, 64K ctx, ~90 min max output) │ ├── config.json │ ├── preprocessor_config.json │ ├── model-0000{1..3}-of-00003.safetensors │ └── … ├── tts-large/ ← aoi-ot/VibeVoice-Large (17.6 GB, 32K ctx, ~45 min max output, premium 7B/9B backbone) │ ├── config.json │ ├── preprocessor_config.json │ ├── configuration.json │ ├── model-0000{1..10}-of-00010.safetensors │ └── … ├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant) │ ├── config.json │ ├── model-0000{1..8}-of-00008.safetensors │ └── … └── realtime-0.5b/ ← microsoft/VibeVoice-Realtime-0.5B (2.0 GB + 100 MB voices) ├── config.json ├── preprocessor_config.json ├── model.safetensors └── voices/ ← 25 baked-in voice presets (KV-cache .pt files, NOT model weights) ├── en-Carter_man.pt ├── en-Frank_man.pt ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp) ``` ### About `tts-large/` This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft originally published it at `microsoft/VibeVoice-Large` with full MIT license, then **removed the repo on 2025-09-05** along with the demo scripts (same RAI cleanup that removed `modeling_vibevoice_inference.py`). The MIT license remains in force on the released weights — this mirror sources from [`aoi-ot/VibeVoice-Large`](https://huggingface.co/aoi-ot/VibeVoice-Large), a community preserve uploaded on 2025-09-04 (one day before Microsoft's pull) that retains the full original release. **Differences from `tts-1.5b/`:** - Same `Speaker N:` script format, same voice-cloning workflow (pass per-speaker reference clips) - Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM - **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog. For users on smaller cards (12 GB / 16 GB), MAESTRO's runner auto-quantizes `tts-large/` to **NF4** via bitsandbytes at runtime (~8 GB working set, well-tested across the HF ecosystem). No separate pre-quantized variant is needed — the runtime path produces identical quality to a pre-quantized mirror, with fewer compatibility issues across `transformers` upgrades. ### About `realtime-0.5b/voices/*.pt` These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library. ### About `asr-7b/` This is the legacy research variant of VibeVoice ASR — the one that runs cleanly on `transformers>=4.51.3,<5.0.0`. Microsoft also publishes a `microsoft/VibeVoice-ASR-HF` repo with the cleaner `apply_transcription_request` API, but that variant requires `transformers>=5.3.0` which is not yet compatible with the rest of MAESTRO's model stack. ## Variant capabilities | Variant | Task | Languages | Max length | Notes | |---|---|---|---|---| | `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips | | `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality. Auto-quantizes to NF4 at runtime on smaller cards (~8 GB working set). | | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt | | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker | ## Responsible use (verbatim from upstream model cards) > **VibeVoice is limited to research-purpose use exploring highly realistic audio dialogue generation.** > > The following are explicitly out of scope: > - Voice impersonation without explicit, recorded consent > - Disinformation or impersonation > - Real-time or low-latency voice conversion for live deep-fakes > - Generation in unsupported languages (non-English, non-Chinese) > - Generation of background ambience, Foley, or music > - Circumventing the watermark or audible disclaimer > > **We do not recommend using VibeVoice in commercial or real-world applications without further testing and development.** This model is intended for research and development purposes only. To mitigate misuse, Microsoft has: - Embedded an **audible disclaimer** ("This segment was generated by AI") in TTS outputs. - Added an **imperceptible perceptual watermark** to all generated audio. - Logged inference requests (hashed) for abuse-pattern detection. These mitigations are baked into the released weights and are preserved in this mirror. ## Attribution | Component | Source | License | |---|---|---| | Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT | | Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) | | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT | | Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT | ## Citation ```bibtex @misc{peng2025vibevoicetechnicalreport, title = {VibeVoice Technical Report}, author = {Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei}, year = {2025}, eprint = {2508.19205}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2508.19205} } ```