AEmotionStudio commited on
Commit
ff31a39
·
verified ·
1 Parent(s): 676879b

docs: tts-large-fp8 — correct VRAM requirements (16+ GB), document no-spillover constraint

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -67,16 +67,18 @@ This is a derivative of `tts-large/` with the LM backbone (Qwen2.5-7B layers —
67
 
68
  | | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
69
  |---|---|---|
70
- | Disk size | ~17.6 GB | **~9 GB** |
71
- | Working set on GPU | ~20 GB | **~10 GB** |
72
- | Min GPU | 12 GB (with NF4 runtime cast) | **8 GB native** |
73
- | Recommended GPU | 24 GB | **12 GB** |
74
  | Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
75
  | Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
76
 
77
  **How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
78
 
79
- **When to pick this over `tts-large/`:** if your GPU has 12-16 GB VRAM and you don't want to rely on the runtime auto-quantization fallback. Same workflow, smaller download, no quality regression on most material.
 
 
80
 
81
  ### About `realtime-0.5b/voices/*.pt`
82
 
 
67
 
68
  | | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
69
  |---|---|---|
70
+ | Disk size | ~17.6 GB | **~11 GB** |
71
+ | Working set on GPU | ~20 GB | **~13 GB** (10 GB FP8 + 1.5 GB bf16 + 1.5 GB activations/KV) |
72
+ | Min GPU | 12 GB (with NF4 runtime cast) | **14 GB** |
73
+ | Recommended GPU | 24 GB | **16 GB** |
74
  | Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
75
  | Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
76
 
77
  **How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
78
 
79
+ **Important no CPU/disk spillover:** `transformers + torchao` cannot reload this checkpoint via `device_map="auto"` (the meta-init path applies the state_dict before reconstructing the FP8 tensor subclass, hitting `AttributeError: ..._weight_qdata is neither a parameter, buffer, nor extra state`). Loaders MUST use `device_map="cuda"` (eager init), so accelerate-style spillover to CPU/disk is unavailable — the full ~13 GB working set has to fit on the GPU.
80
+
81
+ **When to pick this over `tts-large/`:** if your GPU has **16+ GB VRAM** and you want a smaller download that loads at native speed. **On 12 GB cards, use `tts-large/` instead** — it auto-quantizes to NF4 at runtime (~8 GB working set, well-tested), accepts CPU/disk offload if needed, and produces the same quality.
82
 
83
  ### About `realtime-0.5b/voices/*.pt`
84