Text-to-Speech
Transformers
Safetensors
English
Chinese
speech-recognition
tts
asr
voice-cloning
long-form
multi-speaker
streaming
mirror
Instructions to use AEmotionStudio/vibevoice-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEmotionStudio/vibevoice-models with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="AEmotionStudio/vibevoice-models")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AEmotionStudio/vibevoice-models", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs: tts-large-fp8 — correct VRAM requirements (16+ GB), document no-spillover constraint
Browse files
README.md
CHANGED
|
@@ -67,16 +67,18 @@ This is a derivative of `tts-large/` with the LM backbone (Qwen2.5-7B layers —
|
|
| 67 |
|
| 68 |
| | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
|
| 69 |
|---|---|---|
|
| 70 |
-
| Disk size | ~17.6 GB | **~
|
| 71 |
-
| Working set on GPU | ~20 GB | **~
|
| 72 |
-
| Min GPU | 12 GB (with NF4 runtime cast) | **
|
| 73 |
-
| Recommended GPU | 24 GB | **
|
| 74 |
| Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
|
| 75 |
| Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
|
| 76 |
|
| 77 |
**How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
|
| 78 |
|
| 79 |
-
**
|
|
|
|
|
|
|
| 80 |
|
| 81 |
### About `realtime-0.5b/voices/*.pt`
|
| 82 |
|
|
|
|
| 67 |
|
| 68 |
| | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
|
| 69 |
|---|---|---|
|
| 70 |
+
| Disk size | ~17.6 GB | **~11 GB** |
|
| 71 |
+
| Working set on GPU | ~20 GB | **~13 GB** (10 GB FP8 + 1.5 GB bf16 + 1.5 GB activations/KV) |
|
| 72 |
+
| Min GPU | 12 GB (with NF4 runtime cast) | **14 GB** |
|
| 73 |
+
| Recommended GPU | 24 GB | **16 GB** |
|
| 74 |
| Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
|
| 75 |
| Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
|
| 76 |
|
| 77 |
**How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
|
| 78 |
|
| 79 |
+
**Important — no CPU/disk spillover:** `transformers + torchao` cannot reload this checkpoint via `device_map="auto"` (the meta-init path applies the state_dict before reconstructing the FP8 tensor subclass, hitting `AttributeError: ..._weight_qdata is neither a parameter, buffer, nor extra state`). Loaders MUST use `device_map="cuda"` (eager init), so accelerate-style spillover to CPU/disk is unavailable — the full ~13 GB working set has to fit on the GPU.
|
| 80 |
+
|
| 81 |
+
**When to pick this over `tts-large/`:** if your GPU has **16+ GB VRAM** and you want a smaller download that loads at native speed. **On 12 GB cards, use `tts-large/` instead** — it auto-quantizes to NF4 at runtime (~8 GB working set, well-tested), accepts CPU/disk offload if needed, and produces the same quality.
|
| 82 |
|
| 83 |
### About `realtime-0.5b/voices/*.pt`
|
| 84 |
|