| --- |
| license: cc-by-nc-4.0 |
| language: |
| - en |
| - zh |
| pipeline_tag: text-to-speech |
| tags: |
| - text-to-speech |
| - zero-shot-tts |
| - waveform-generation |
| - comfyui |
| - safetensors |
| - fp32 |
| - bf16 |
| - wavtts |
| - voice-cloning |
| - reference-audio |
| datasets: |
| - amphion/Emilia-Dataset |
| --- |
| |
| # WavTTS 16k Safetensors for ComfyUI |
|
|
| Safetensors conversion of the official **WavTTS** model for use with **WavTTS-ComfyUI**. |
|
|
|  |
|
|
| This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead. |
|
|
| ## Model Introduction |
|
|
| WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting. |
|
|
| Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples. |
|
|
| ### Links |
|
|
| - **ComfyUI Node:** https://github.com/Saganaki22/WavTTS-ComfyUI |
| - **Official Repository:** https://github.com/cwx-worst-one/WavTTS |
| - **Research Paper:** https://arxiv.org/abs/2606.03455 |
|
|
| ## Usage |
|
|
| This checkpoint is intended for use with: |
|
|
| https://github.com/Saganaki22/WavTTS-ComfyUI |
|
|
| Place the model inside: |
|
|
| ```text |
| ComfyUI/models/wavtts/ |
| ``` |
|
|
| Example: |
|
|
| ```text |
| ComfyUI/models/wavtts/ |
| ├── wavtts-fp32.safetensors |
| └── wavtts-mixed-bf16.safetensors |
| ``` |
|
|
| Restart ComfyUI and load the checkpoint using the **WavTTS Load Model** node. |
|
|
| ## Model Summary |
|
|
| | Item | Value | |
| |--------|--------| |
| | Model | WavTTS | |
| | Format | Safetensors | |
| | Task | Zero-Shot Text-to-Speech | |
| | Sample Rate | 16 kHz | |
| | Architecture | Direct Waveform Generation | |
| | Conditioning | Reference Audio + Reference Transcript | |
| | Intended Platform | ComfyUI | |
| | Languages | English, Chinese | |
| | Training Dataset | Emilia Dataset | |
| | License | CC BY-NC 4.0 | |
|
|
| ## Included Variants |
|
|
| | File | Description | |
| |--------|--------| |
| | wavtts-fp32.safetensors | Clean FP32 inference checkpoint | |
| | wavtts-mixed-bf16.safetensors | Mixed BF16 checkpoint optimized for lower VRAM usage | |
|
|
| The original `.pt` checkpoint contains training-related state that is unnecessary for inference. |
|
|
| Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior. |
|
|
| ## Intended Use |
|
|
| This model is intended for: |
|
|
| - Zero-shot text-to-speech |
| - Voice continuation |
| - Reference-audio conditioned generation |
| - Voice adaptation from short prompts |
| - ComfyUI speech workflows |
| - Local offline TTS generation |
|
|
| A transcript of the reference audio is required by WavTTS. |
|
|
| ## Precision Notes |
|
|
| ### FP32 |
|
|
| Recommended for maximum stability. |
|
|
| ### Mixed BF16 |
|
|
| Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32. |
|
|
| No model weights have been retrained or modified beyond precision conversion. |
|
|
| ## Architecture |
|
|
| WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens. |
|
|
| ### Inputs |
|
|
| - Reference audio |
| - Reference transcript |
| - Target text |
|
|
| ### Output |
|
|
| - 16 kHz synthesized waveform |
|
|
| The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style. |
|
|
| ## Limitations |
|
|
| - Requires a transcript for the reference audio. |
| - Voice similarity is not guaranteed. |
| - Long generations may require chunking. |
| - Audio quality depends heavily on reference quality. |
| - Commercial usage may be restricted by the upstream license. |
| - Numerical differences may occur between original and converted checkpoints. |
|
|
| ## Attribution |
|
|
| ### Original Authors |
|
|
| **WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling** |
|
|
| Official repository: |
|
|
| https://github.com/cwx-worst-one/WavTTS |
|
|
| ### ComfyUI Integration |
|
|
| https://github.com/Saganaki22/WavTTS-ComfyUI |
|
|
| ### Dataset |
|
|
| https://huggingface.co/datasets/amphion/Emilia-Dataset |
|
|
| ## License |
|
|
| This Safetensors conversion inherits the licensing terms of the original WavTTS release. |
|
|
| The original model weights are licensed under **CC BY-NC 4.0**. |
|
|
| Please review the upstream model card and license terms before redistribution or commercial use. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{chen2026wavtts, |
| title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling}, |
| author={TODO}, |
| journal={TODO}, |
| year={2026} |
| } |
| ``` |