AEmotionStudio
/

vibevoice-models

@@ -67,16 +67,18 @@ This is a derivative of `tts-large/` with the LM backbone (Qwen2.5-7B layers —
 | | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
 |---|---|---|
-| Disk size | ~17.6 GB | **~9 GB** |
-| Working set on GPU | ~20 GB | **~10 GB** |
-| Min GPU | 12 GB (with NF4 runtime cast) | **8 GB native** |
-| Recommended GPU | 24 GB | **12 GB** |
 | Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
 | Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
 **How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
-**When to pick this over `tts-large/`:** if your GPU has 12-16 GB VRAM and you don't want to rely on the runtime auto-quantization fallback. Same workflow, smaller download, no quality regression on most material.
 ### About `realtime-0.5b/voices/*.pt`

 | | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
 |---|---|---|
+| Disk size | ~17.6 GB | **~11 GB** |
+| Working set on GPU | ~20 GB | **~13 GB** (10 GB FP8 + 1.5 GB bf16 + 1.5 GB activations/KV) |
+| Min GPU | 12 GB (with NF4 runtime cast) | **14 GB** |
+| Recommended GPU | 24 GB | **16 GB** |
 | Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
 | Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
 **How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
+**Important — no CPU/disk spillover:** `transformers + torchao` cannot reload this checkpoint via `device_map="auto"` (the meta-init path applies the state_dict before reconstructing the FP8 tensor subclass, hitting `AttributeError: ..._weight_qdata is neither a parameter, buffer, nor extra state`). Loaders MUST use `device_map="cuda"` (eager init), so accelerate-style spillover to CPU/disk is unavailable — the full ~13 GB working set has to fit on the GPU.
+**When to pick this over `tts-large/`:** if your GPU has **16+ GB VRAM** and you want a smaller download that loads at native speed. **On 12 GB cards, use `tts-large/` instead** — it auto-quantizes to NF4 at runtime (~8 GB working set, well-tested), accepts CPU/disk offload if needed, and produces the same quality.
 ### About `realtime-0.5b/voices/*.pt`