Text-to-Speech
Transformers
Safetensors
English
Chinese
speech-recognition
tts
asr
voice-cloning
long-form
multi-speaker
streaming
mirror
Instructions to use AEmotionStudio/vibevoice-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEmotionStudio/vibevoice-models with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="AEmotionStudio/vibevoice-models")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AEmotionStudio/vibevoice-models", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs: README — add tts-large/ subfolder + aoi-ot attribution
Browse files
README.md
CHANGED
|
@@ -27,11 +27,17 @@ It exists so MAESTRO's downloader can fetch each variant (and its dependencies)
|
|
| 27 |
|
| 28 |
```
|
| 29 |
vibevoice-models/
|
| 30 |
-
├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB)
|
| 31 |
│ ├── config.json
|
| 32 |
│ ├── preprocessor_config.json
|
| 33 |
│ ├── model-0000{1..3}-of-00003.safetensors
|
| 34 |
│ └── …
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant)
|
| 36 |
│ ├── config.json
|
| 37 |
│ ├── model-0000{1..8}-of-00008.safetensors
|
|
@@ -46,6 +52,15 @@ vibevoice-models/
|
|
| 46 |
├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
|
| 47 |
```
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
### About `realtime-0.5b/voices/*.pt`
|
| 50 |
|
| 51 |
These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
|
|
@@ -58,7 +73,8 @@ This is the legacy research variant of VibeVoice ASR — the one that runs clean
|
|
| 58 |
|
| 59 |
| Variant | Task | Languages | Max length | Notes |
|
| 60 |
|---|---|---|---|---|
|
| 61 |
-
| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags |
|
|
|
|
| 62 |
| `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
|
| 63 |
| `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
|
| 64 |
|
|
@@ -87,9 +103,10 @@ These mitigations are baked into the released weights and are preserved in this
|
|
| 87 |
|
| 88 |
| Component | Source | License |
|
| 89 |
|---|---|---|
|
| 90 |
-
| Model weights | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
|
|
|
|
| 91 |
| Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
|
| 92 |
-
| Inference code (TTS
|
| 93 |
|
| 94 |
## Citation
|
| 95 |
|
|
|
|
| 27 |
|
| 28 |
```
|
| 29 |
vibevoice-models/
|
| 30 |
+
├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB, 64K ctx, ~90 min max output)
|
| 31 |
│ ├── config.json
|
| 32 |
│ ├── preprocessor_config.json
|
| 33 |
│ ├── model-0000{1..3}-of-00003.safetensors
|
| 34 |
│ └── …
|
| 35 |
+
├── tts-large/ ← aoi-ot/VibeVoice-Large (17.6 GB, 32K ctx, ~45 min max output, premium 7B/9B backbone)
|
| 36 |
+
│ ├── config.json
|
| 37 |
+
│ ├── preprocessor_config.json
|
| 38 |
+
│ ├── configuration.json
|
| 39 |
+
│ ├── model-0000{1..10}-of-00010.safetensors
|
| 40 |
+
│ └── …
|
| 41 |
├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant)
|
| 42 |
│ ├── config.json
|
| 43 |
│ ├── model-0000{1..8}-of-00008.safetensors
|
|
|
|
| 52 |
├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
|
| 53 |
```
|
| 54 |
|
| 55 |
+
### About `tts-large/`
|
| 56 |
+
|
| 57 |
+
This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft originally published it at `microsoft/VibeVoice-Large` with full MIT license, then **removed the repo on 2025-09-05** along with the demo scripts (same RAI cleanup that removed `modeling_vibevoice_inference.py`). The MIT license remains in force on the released weights — this mirror sources from [`aoi-ot/VibeVoice-Large`](https://huggingface.co/aoi-ot/VibeVoice-Large), a community preserve uploaded on 2025-09-04 (one day before Microsoft's pull) that retains the full original release.
|
| 58 |
+
|
| 59 |
+
**Differences from `tts-1.5b/`:**
|
| 60 |
+
- Same `Speaker N:` script format, same voice-cloning workflow (pass per-speaker reference clips)
|
| 61 |
+
- Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
|
| 62 |
+
- **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
|
| 63 |
+
|
| 64 |
### About `realtime-0.5b/voices/*.pt`
|
| 65 |
|
| 66 |
These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
|
|
|
|
| 73 |
|
| 74 |
| Variant | Task | Languages | Max length | Notes |
|
| 75 |
|---|---|---|---|---|
|
| 76 |
+
| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips |
|
| 77 |
+
| `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality |
|
| 78 |
| `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
|
| 79 |
| `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
|
| 80 |
|
|
|
|
| 103 |
|
| 104 |
| Component | Source | License |
|
| 105 |
|---|---|---|
|
| 106 |
+
| Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
|
| 107 |
+
| Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) |
|
| 108 |
| Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
|
| 109 |
+
| Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
|
| 110 |
|
| 111 |
## Citation
|
| 112 |
|