drbaph commited on
Commit
8f2118c
·
verified ·
1 Parent(s): 8a137b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -0
README.md CHANGED
@@ -1,3 +1,175 @@
1
  ---
2
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: text-to-speech
7
+ tags:
8
+ - text-to-speech
9
+ - zero-shot-tts
10
+ - waveform-generation
11
+ - comfyui
12
+ - safetensors
13
+ - fp32
14
+ - bf16
15
+ - wavtts
16
+ - voice-cloning
17
+ - reference-audio
18
+ datasets:
19
+ - amphion/Emilia-Dataset
20
  ---
21
+
22
+ # WavTTS 16k Safetensors for ComfyUI
23
+
24
+ Safetensors conversion of the official **WavTTS** model for use with **WavTTS-ComfyUI**.
25
+
26
+ ![Screenshot 2026-06-04 015737](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/vv1P0WoTgUfWi6MFaYJoH.png)
27
+
28
+ This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead. No retraining, finetuning, quantization, or architectural modifications have been performed.
29
+
30
+ ## Model Introduction
31
+
32
+ WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting.
33
+
34
+ Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples.
35
+
36
+ ### Links
37
+
38
+ - **ComfyUI Node:** https://github.com/Saganaki22/WavTTS-ComfyUI
39
+ - **Official Repository:** https://github.com/cwx-worst-one/WavTTS
40
+ - **Research Paper:** https://arxiv.org/abs/2606.03455
41
+
42
+ ## Usage
43
+
44
+ This checkpoint is intended for use with:
45
+
46
+ https://github.com/Saganaki22/WavTTS-ComfyUI
47
+
48
+ Place the model inside:
49
+
50
+ ```text
51
+ ComfyUI/models/wavtts/
52
+ ```
53
+
54
+ Example:
55
+
56
+ ```text
57
+ ComfyUI/models/wavtts/
58
+ ├── wavtts-fp32.safetensors
59
+ └── wavtts-mixed-bf16.safetensors
60
+ ```
61
+
62
+ Restart ComfyUI and load the checkpoint using the **WavTTS Load Model** node.
63
+
64
+ ## Model Summary
65
+
66
+ | Item | Value |
67
+ |--------|--------|
68
+ | Model | WavTTS |
69
+ | Format | Safetensors |
70
+ | Task | Zero-Shot Text-to-Speech |
71
+ | Sample Rate | 16 kHz |
72
+ | Architecture | Direct Waveform Generation |
73
+ | Conditioning | Reference Audio + Reference Transcript |
74
+ | Intended Platform | ComfyUI |
75
+ | Languages | English, Chinese |
76
+ | Training Dataset | Emilia Dataset |
77
+ | License | CC BY-NC 4.0 |
78
+
79
+ ## Included Variants
80
+
81
+ | File | Description |
82
+ |--------|--------|
83
+ | wavtts-fp32.safetensors | Clean FP32 inference checkpoint |
84
+ | wavtts-mixed-bf16.safetensors | Mixed BF16 checkpoint optimized for lower VRAM usage |
85
+
86
+ The original `.pt` checkpoint contains training-related state that is unnecessary for inference.
87
+
88
+ Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior.
89
+
90
+ ## Intended Use
91
+
92
+ This model is intended for:
93
+
94
+ - Zero-shot text-to-speech
95
+ - Voice continuation
96
+ - Reference-audio conditioned generation
97
+ - Voice adaptation from short prompts
98
+ - ComfyUI speech workflows
99
+ - Local offline TTS generation
100
+
101
+ A transcript of the reference audio is required by WavTTS.
102
+
103
+ ## Precision Notes
104
+
105
+ ### FP32
106
+
107
+ Recommended for maximum stability.
108
+
109
+ ### Mixed BF16
110
+
111
+ Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32.
112
+
113
+ No model weights have been retrained or modified beyond precision conversion.
114
+
115
+ ## Architecture
116
+
117
+ WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens.
118
+
119
+ ### Inputs
120
+
121
+ - Reference audio
122
+ - Reference transcript
123
+ - Target text
124
+
125
+ ### Output
126
+
127
+ - 16 kHz synthesized waveform
128
+
129
+ The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style.
130
+
131
+ ## Limitations
132
+
133
+ - Requires a transcript for the reference audio.
134
+ - Voice similarity is not guaranteed.
135
+ - Long generations may require chunking.
136
+ - Audio quality depends heavily on reference quality.
137
+ - Commercial usage may be restricted by the upstream license.
138
+ - Numerical differences may occur between original and converted checkpoints.
139
+
140
+ ## Attribution
141
+
142
+ ### Original Authors
143
+
144
+ **WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling**
145
+
146
+ Official repository:
147
+
148
+ https://github.com/cwx-worst-one/WavTTS
149
+
150
+ ### ComfyUI Integration
151
+
152
+ https://github.com/Saganaki22/WavTTS-ComfyUI
153
+
154
+ ### Dataset
155
+
156
+ https://huggingface.co/datasets/amphion/Emilia-Dataset
157
+
158
+ ## License
159
+
160
+ This Safetensors conversion inherits the licensing terms of the original WavTTS release.
161
+
162
+ The original model weights are licensed under **CC BY-NC 4.0**.
163
+
164
+ Please review the upstream model card and license terms before redistribution or commercial use.
165
+
166
+ ## Citation
167
+
168
+ ```bibtex
169
+ @article{chen2026wavtts,
170
+ title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling},
171
+ author={TODO},
172
+ journal={TODO},
173
+ year={2026}
174
+ }
175
+ ```