Text-to-Speech
Transformers
Safetensors
English
Chinese
speech-recognition
tts
asr
voice-cloning
long-form
multi-speaker
streaming
mirror
Instructions to use AEmotionStudio/vibevoice-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEmotionStudio/vibevoice-models with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="AEmotionStudio/vibevoice-models")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AEmotionStudio/vibevoice-models", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs: README — add tts-large-fp8/ section + comparison table
Browse files
README.md
CHANGED
|
@@ -61,6 +61,23 @@ This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft orig
|
|
| 61 |
- Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
|
| 62 |
- **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
### About `realtime-0.5b/voices/*.pt`
|
| 65 |
|
| 66 |
These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
|
|
@@ -75,6 +92,7 @@ This is the legacy research variant of VibeVoice ASR — the one that runs clean
|
|
| 75 |
|---|---|---|---|---|
|
| 76 |
| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips |
|
| 77 |
| `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality |
|
|
|
|
| 78 |
| `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
|
| 79 |
| `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
|
| 80 |
|
|
@@ -105,6 +123,7 @@ These mitigations are baked into the released weights and are preserved in this
|
|
| 105 |
|---|---|---|
|
| 106 |
| Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
|
| 107 |
| Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) |
|
|
|
|
| 108 |
| Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
|
| 109 |
| Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
|
| 110 |
|
|
|
|
| 61 |
- Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
|
| 62 |
- **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
|
| 63 |
|
| 64 |
+
### About `tts-large-fp8/` — pre-quantized FP8 build
|
| 65 |
+
|
| 66 |
+
This is a derivative of `tts-large/` with the LM backbone (Qwen2.5-7B layers — the bulk of the parameter mass at ~14 GB bf16) pre-quantized to **Float8WeightOnly** via `torchao.quantization`. The diffusion head and acoustic/semantic tokenizers (~1.5 GB combined) are **left at bf16** because they're numerically sensitive and the savings on those small modules wouldn't matter anyway.
|
| 67 |
+
|
| 68 |
+
| | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
|
| 69 |
+
|---|---|---|
|
| 70 |
+
| Disk size | ~17.6 GB | **~9 GB** |
|
| 71 |
+
| Working set on GPU | ~20 GB | **~10 GB** |
|
| 72 |
+
| Min GPU | 12 GB (with NF4 runtime cast) | **8 GB native** |
|
| 73 |
+
| Recommended GPU | 24 GB | **12 GB** |
|
| 74 |
+
| Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
|
| 75 |
+
| Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
|
| 76 |
+
|
| 77 |
+
**How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
|
| 78 |
+
|
| 79 |
+
**When to pick this over `tts-large/`:** if your GPU has 12-16 GB VRAM and you don't want to rely on the runtime auto-quantization fallback. Same workflow, smaller download, no quality regression on most material.
|
| 80 |
+
|
| 81 |
### About `realtime-0.5b/voices/*.pt`
|
| 82 |
|
| 83 |
These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
|
|
|
|
| 92 |
|---|---|---|---|---|
|
| 93 |
| `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips |
|
| 94 |
| `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality |
|
| 95 |
+
| `tts-large-fp8` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Pre-quantized FP8 build of `tts-large` — same workflow, fits 12 GB cards |
|
| 96 |
| `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
|
| 97 |
| `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
|
| 98 |
|
|
|
|
| 123 |
|---|---|---|
|
| 124 |
| Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
|
| 125 |
| Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) |
|
| 126 |
+
| Model weights — 7B Large FP8 | Derivative of `aoi-ot/VibeVoice-Large` — pre-quantized via `torchao.quantization.Float8WeightOnlyConfig` (LM backbone only) | MIT (derivative) |
|
| 127 |
| Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
|
| 128 |
| Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
|
| 129 |
|