Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,175 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
pipeline_tag: text-to-speech
|
| 7 |
+
tags:
|
| 8 |
+
- text-to-speech
|
| 9 |
+
- zero-shot-tts
|
| 10 |
+
- waveform-generation
|
| 11 |
+
- comfyui
|
| 12 |
+
- safetensors
|
| 13 |
+
- fp32
|
| 14 |
+
- bf16
|
| 15 |
+
- wavtts
|
| 16 |
+
- voice-cloning
|
| 17 |
+
- reference-audio
|
| 18 |
+
datasets:
|
| 19 |
+
- amphion/Emilia-Dataset
|
| 20 |
---
|
| 21 |
+
|
| 22 |
+
# WavTTS 16k Safetensors for ComfyUI
|
| 23 |
+
|
| 24 |
+
Safetensors conversion of the official **WavTTS** model for use with **WavTTS-ComfyUI**.
|
| 25 |
+
|
| 26 |
+

|
| 27 |
+
|
| 28 |
+
This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead. No retraining, finetuning, quantization, or architectural modifications have been performed.
|
| 29 |
+
|
| 30 |
+
## Model Introduction
|
| 31 |
+
|
| 32 |
+
WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting.
|
| 33 |
+
|
| 34 |
+
Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples.
|
| 35 |
+
|
| 36 |
+
### Links
|
| 37 |
+
|
| 38 |
+
- **ComfyUI Node:** https://github.com/Saganaki22/WavTTS-ComfyUI
|
| 39 |
+
- **Official Repository:** https://github.com/cwx-worst-one/WavTTS
|
| 40 |
+
- **Research Paper:** https://arxiv.org/abs/2606.03455
|
| 41 |
+
|
| 42 |
+
## Usage
|
| 43 |
+
|
| 44 |
+
This checkpoint is intended for use with:
|
| 45 |
+
|
| 46 |
+
https://github.com/Saganaki22/WavTTS-ComfyUI
|
| 47 |
+
|
| 48 |
+
Place the model inside:
|
| 49 |
+
|
| 50 |
+
```text
|
| 51 |
+
ComfyUI/models/wavtts/
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
Example:
|
| 55 |
+
|
| 56 |
+
```text
|
| 57 |
+
ComfyUI/models/wavtts/
|
| 58 |
+
├── wavtts-fp32.safetensors
|
| 59 |
+
└── wavtts-mixed-bf16.safetensors
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
Restart ComfyUI and load the checkpoint using the **WavTTS Load Model** node.
|
| 63 |
+
|
| 64 |
+
## Model Summary
|
| 65 |
+
|
| 66 |
+
| Item | Value |
|
| 67 |
+
|--------|--------|
|
| 68 |
+
| Model | WavTTS |
|
| 69 |
+
| Format | Safetensors |
|
| 70 |
+
| Task | Zero-Shot Text-to-Speech |
|
| 71 |
+
| Sample Rate | 16 kHz |
|
| 72 |
+
| Architecture | Direct Waveform Generation |
|
| 73 |
+
| Conditioning | Reference Audio + Reference Transcript |
|
| 74 |
+
| Intended Platform | ComfyUI |
|
| 75 |
+
| Languages | English, Chinese |
|
| 76 |
+
| Training Dataset | Emilia Dataset |
|
| 77 |
+
| License | CC BY-NC 4.0 |
|
| 78 |
+
|
| 79 |
+
## Included Variants
|
| 80 |
+
|
| 81 |
+
| File | Description |
|
| 82 |
+
|--------|--------|
|
| 83 |
+
| wavtts-fp32.safetensors | Clean FP32 inference checkpoint |
|
| 84 |
+
| wavtts-mixed-bf16.safetensors | Mixed BF16 checkpoint optimized for lower VRAM usage |
|
| 85 |
+
|
| 86 |
+
The original `.pt` checkpoint contains training-related state that is unnecessary for inference.
|
| 87 |
+
|
| 88 |
+
Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior.
|
| 89 |
+
|
| 90 |
+
## Intended Use
|
| 91 |
+
|
| 92 |
+
This model is intended for:
|
| 93 |
+
|
| 94 |
+
- Zero-shot text-to-speech
|
| 95 |
+
- Voice continuation
|
| 96 |
+
- Reference-audio conditioned generation
|
| 97 |
+
- Voice adaptation from short prompts
|
| 98 |
+
- ComfyUI speech workflows
|
| 99 |
+
- Local offline TTS generation
|
| 100 |
+
|
| 101 |
+
A transcript of the reference audio is required by WavTTS.
|
| 102 |
+
|
| 103 |
+
## Precision Notes
|
| 104 |
+
|
| 105 |
+
### FP32
|
| 106 |
+
|
| 107 |
+
Recommended for maximum stability.
|
| 108 |
+
|
| 109 |
+
### Mixed BF16
|
| 110 |
+
|
| 111 |
+
Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32.
|
| 112 |
+
|
| 113 |
+
No model weights have been retrained or modified beyond precision conversion.
|
| 114 |
+
|
| 115 |
+
## Architecture
|
| 116 |
+
|
| 117 |
+
WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens.
|
| 118 |
+
|
| 119 |
+
### Inputs
|
| 120 |
+
|
| 121 |
+
- Reference audio
|
| 122 |
+
- Reference transcript
|
| 123 |
+
- Target text
|
| 124 |
+
|
| 125 |
+
### Output
|
| 126 |
+
|
| 127 |
+
- 16 kHz synthesized waveform
|
| 128 |
+
|
| 129 |
+
The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style.
|
| 130 |
+
|
| 131 |
+
## Limitations
|
| 132 |
+
|
| 133 |
+
- Requires a transcript for the reference audio.
|
| 134 |
+
- Voice similarity is not guaranteed.
|
| 135 |
+
- Long generations may require chunking.
|
| 136 |
+
- Audio quality depends heavily on reference quality.
|
| 137 |
+
- Commercial usage may be restricted by the upstream license.
|
| 138 |
+
- Numerical differences may occur between original and converted checkpoints.
|
| 139 |
+
|
| 140 |
+
## Attribution
|
| 141 |
+
|
| 142 |
+
### Original Authors
|
| 143 |
+
|
| 144 |
+
**WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling**
|
| 145 |
+
|
| 146 |
+
Official repository:
|
| 147 |
+
|
| 148 |
+
https://github.com/cwx-worst-one/WavTTS
|
| 149 |
+
|
| 150 |
+
### ComfyUI Integration
|
| 151 |
+
|
| 152 |
+
https://github.com/Saganaki22/WavTTS-ComfyUI
|
| 153 |
+
|
| 154 |
+
### Dataset
|
| 155 |
+
|
| 156 |
+
https://huggingface.co/datasets/amphion/Emilia-Dataset
|
| 157 |
+
|
| 158 |
+
## License
|
| 159 |
+
|
| 160 |
+
This Safetensors conversion inherits the licensing terms of the original WavTTS release.
|
| 161 |
+
|
| 162 |
+
The original model weights are licensed under **CC BY-NC 4.0**.
|
| 163 |
+
|
| 164 |
+
Please review the upstream model card and license terms before redistribution or commercial use.
|
| 165 |
+
|
| 166 |
+
## Citation
|
| 167 |
+
|
| 168 |
+
```bibtex
|
| 169 |
+
@article{chen2026wavtts,
|
| 170 |
+
title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling},
|
| 171 |
+
author={TODO},
|
| 172 |
+
journal={TODO},
|
| 173 |
+
year={2026}
|
| 174 |
+
}
|
| 175 |
+
```
|