AEmotionStudio
/

vibevoice-models

@@ -27,11 +27,17 @@ It exists so MAESTRO's downloader can fetch each variant (and its dependencies)
 ```
 vibevoice-models/
-├── tts-1.5b/           ← microsoft/VibeVoice-1.5B  (5.4 GB)
 │   ├── config.json
 │   ├── preprocessor_config.json
 │   ├── model-0000{1..3}-of-00003.safetensors
 │   └── …
 ├── asr-7b/             ← microsoft/VibeVoice-ASR    (17.4 GB, legacy/research variant)
 │   ├── config.json
 │   ├── model-0000{1..8}-of-00008.safetensors
@@ -46,6 +52,15 @@ vibevoice-models/
         ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
 ```
 ### About `realtime-0.5b/voices/*.pt`
 These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
@@ -58,7 +73,8 @@ This is the legacy research variant of VibeVoice ASR — the one that runs clean
 | Variant | Task | Languages | Max length | Notes |
 |---|---|---|---|---|
-| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags |
 | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
 | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
@@ -87,9 +103,10 @@ These mitigations are baked into the released weights and are preserved in this
 | Component | Source | License |
 |---|---|---|
-| Model weights | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
 | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
-| Inference code (TTS-1.5B) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
 ## Citation

 ```
 vibevoice-models/
+├── tts-1.5b/           ← microsoft/VibeVoice-1.5B  (5.4 GB, 64K ctx, ~90 min max output)
 │   ├── config.json
 │   ├── preprocessor_config.json
 │   ├── model-0000{1..3}-of-00003.safetensors
 │   └── …
+├── tts-large/          ← aoi-ot/VibeVoice-Large    (17.6 GB, 32K ctx, ~45 min max output, premium 7B/9B backbone)
+│   ├── config.json
+│   ├── preprocessor_config.json
+│   ├── configuration.json
+│   ├── model-0000{1..10}-of-00010.safetensors
+│   └── …
 ├── asr-7b/             ← microsoft/VibeVoice-ASR    (17.4 GB, legacy/research variant)
 │   ├── config.json
 │   ├── model-0000{1..8}-of-00008.safetensors
         ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
 ```
+### About `tts-large/`
+This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft originally published it at `microsoft/VibeVoice-Large` with full MIT license, then **removed the repo on 2025-09-05** along with the demo scripts (same RAI cleanup that removed `modeling_vibevoice_inference.py`). The MIT license remains in force on the released weights — this mirror sources from [`aoi-ot/VibeVoice-Large`](https://huggingface.co/aoi-ot/VibeVoice-Large), a community preserve uploaded on 2025-09-04 (one day before Microsoft's pull) that retains the full original release.
+**Differences from `tts-1.5b/`:**
+- Same `Speaker N:` script format, same voice-cloning workflow (pass per-speaker reference clips)
+- Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
+- **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
 ### About `realtime-0.5b/voices/*.pt`
 These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
 | Variant | Task | Languages | Max length | Notes |
 |---|---|---|---|---|
+| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips |
+| `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality |
 | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
 | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
 | Component | Source | License |
 |---|---|---|
+| Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
+| Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) |
 | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
+| Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
 ## Citation