docs: README — remove tts-large-fp8/ section (variant retired; runtime NF4 on tts-large/ covers the smaller-card use case)
ecca29f verified | license: mit | |
| language: | |
| - en | |
| - zh | |
| tags: | |
| - text-to-speech | |
| - speech-recognition | |
| - tts | |
| - asr | |
| - voice-cloning | |
| - long-form | |
| - multi-speaker | |
| - streaming | |
| - mirror | |
| pipeline_tag: text-to-speech | |
| library_name: transformers | |
| # VibeVoice — AEmotion Studio Mirror | |
| This repository is a **MAESTRO-curated mirror** of Microsoft's [VibeVoice](https://github.com/microsoft/VibeVoice) family, with the long-form-TTS inference code restored from the [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) fork. All weights, code, and assets remain under the upstream **MIT License**. | |
| It exists so MAESTRO's downloader can fetch each variant (and its dependencies) from a single, predictably-laid-out repo with `allow_patterns` filtering, instead of spreading across three separate Microsoft HF repos plus GitHub. | |
| ## Layout | |
| ``` | |
| vibevoice-models/ | |
| ├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB, 64K ctx, ~90 min max output) | |
| │ ├── config.json | |
| │ ├── preprocessor_config.json | |
| │ ├── model-0000{1..3}-of-00003.safetensors | |
| │ └── … | |
| ├── tts-large/ ← aoi-ot/VibeVoice-Large (17.6 GB, 32K ctx, ~45 min max output, premium 7B/9B backbone) | |
| │ ├── config.json | |
| │ ├── preprocessor_config.json | |
| │ ├── configuration.json | |
| │ ├── model-0000{1..10}-of-00010.safetensors | |
| │ └── … | |
| ├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant) | |
| │ ├── config.json | |
| │ ├── model-0000{1..8}-of-00008.safetensors | |
| │ └── … | |
| └── realtime-0.5b/ ← microsoft/VibeVoice-Realtime-0.5B (2.0 GB + 100 MB voices) | |
| ├── config.json | |
| ├── preprocessor_config.json | |
| ├── model.safetensors | |
| └── voices/ ← 25 baked-in voice presets (KV-cache .pt files, NOT model weights) | |
| ├── en-Carter_man.pt | |
| ├── en-Frank_man.pt | |
| ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp) | |
| ``` | |
| ### About `tts-large/` | |
| This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft originally published it at `microsoft/VibeVoice-Large` with full MIT license, then **removed the repo on 2025-09-05** along with the demo scripts (same RAI cleanup that removed `modeling_vibevoice_inference.py`). The MIT license remains in force on the released weights — this mirror sources from [`aoi-ot/VibeVoice-Large`](https://huggingface.co/aoi-ot/VibeVoice-Large), a community preserve uploaded on 2025-09-04 (one day before Microsoft's pull) that retains the full original release. | |
| **Differences from `tts-1.5b/`:** | |
| - Same `Speaker N:` script format, same voice-cloning workflow (pass per-speaker reference clips) | |
| - Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM | |
| - **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog. | |
| For users on smaller cards (12 GB / 16 GB), MAESTRO's runner auto-quantizes `tts-large/` to **NF4** via bitsandbytes at runtime (~8 GB working set, well-tested across the HF ecosystem). No separate pre-quantized variant is needed — the runtime path produces identical quality to a pre-quantized mirror, with fewer compatibility issues across `transformers` upgrades. | |
| ### About `realtime-0.5b/voices/*.pt` | |
| These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library. | |
| ### About `asr-7b/` | |
| This is the legacy research variant of VibeVoice ASR — the one that runs cleanly on `transformers>=4.51.3,<5.0.0`. Microsoft also publishes a `microsoft/VibeVoice-ASR-HF` repo with the cleaner `apply_transcription_request` API, but that variant requires `transformers>=5.3.0` which is not yet compatible with the rest of MAESTRO's model stack. | |
| ## Variant capabilities | |
| | Variant | Task | Languages | Max length | Notes | | |
| |---|---|---|---|---| | |
| | `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips | | |
| | `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality. Auto-quantizes to NF4 at runtime on smaller cards (~8 GB working set). | | |
| | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt | | |
| | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker | | |
| ## Responsible use (verbatim from upstream model cards) | |
| > **VibeVoice is limited to research-purpose use exploring highly realistic audio dialogue generation.** | |
| > | |
| > The following are explicitly out of scope: | |
| > - Voice impersonation without explicit, recorded consent | |
| > - Disinformation or impersonation | |
| > - Real-time or low-latency voice conversion for live deep-fakes | |
| > - Generation in unsupported languages (non-English, non-Chinese) | |
| > - Generation of background ambience, Foley, or music | |
| > - Circumventing the watermark or audible disclaimer | |
| > | |
| > **We do not recommend using VibeVoice in commercial or real-world applications without further testing and development.** This model is intended for research and development purposes only. | |
| To mitigate misuse, Microsoft has: | |
| - Embedded an **audible disclaimer** ("This segment was generated by AI") in TTS outputs. | |
| - Added an **imperceptible perceptual watermark** to all generated audio. | |
| - Logged inference requests (hashed) for abuse-pattern detection. | |
| These mitigations are baked into the released weights and are preserved in this mirror. | |
| ## Attribution | |
| | Component | Source | License | | |
| |---|---|---| | |
| | Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT | | |
| | Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) | | |
| | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT | | |
| | Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT | | |
| ## Citation | |
| ```bibtex | |
| @misc{peng2025vibevoicetechnicalreport, | |
| title = {VibeVoice Technical Report}, | |
| author = {Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and | |
| Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and | |
| Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei}, | |
| year = {2025}, | |
| eprint = {2508.19205}, | |
| archivePrefix = {arXiv}, | |
| primaryClass = {cs.CL}, | |
| url = {https://arxiv.org/abs/2508.19205} | |
| } | |
| ``` | |