WavTTS / README.md
drbaph's picture
Update README.md
8d3712f verified
---
license: cc-by-nc-4.0
language:
- en
- zh
pipeline_tag: text-to-speech
tags:
- text-to-speech
- zero-shot-tts
- waveform-generation
- comfyui
- safetensors
- fp32
- bf16
- wavtts
- voice-cloning
- reference-audio
datasets:
- amphion/Emilia-Dataset
---
# WavTTS 16k Safetensors for ComfyUI
Safetensors conversion of the official **WavTTS** model for use with **WavTTS-ComfyUI**.
![Screenshot 2026-06-04 015737](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/vv1P0WoTgUfWi6MFaYJoH.png)
This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead.
## Model Introduction
WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting.
Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples.
### Links
- **ComfyUI Node:** https://github.com/Saganaki22/WavTTS-ComfyUI
- **Official Repository:** https://github.com/cwx-worst-one/WavTTS
- **Research Paper:** https://arxiv.org/abs/2606.03455
## Usage
This checkpoint is intended for use with:
https://github.com/Saganaki22/WavTTS-ComfyUI
Place the model inside:
```text
ComfyUI/models/wavtts/
```
Example:
```text
ComfyUI/models/wavtts/
├── wavtts-fp32.safetensors
└── wavtts-mixed-bf16.safetensors
```
Restart ComfyUI and load the checkpoint using the **WavTTS Load Model** node.
## Model Summary
| Item | Value |
|--------|--------|
| Model | WavTTS |
| Format | Safetensors |
| Task | Zero-Shot Text-to-Speech |
| Sample Rate | 16 kHz |
| Architecture | Direct Waveform Generation |
| Conditioning | Reference Audio + Reference Transcript |
| Intended Platform | ComfyUI |
| Languages | English, Chinese |
| Training Dataset | Emilia Dataset |
| License | CC BY-NC 4.0 |
## Included Variants
| File | Description |
|--------|--------|
| wavtts-fp32.safetensors | Clean FP32 inference checkpoint |
| wavtts-mixed-bf16.safetensors | Mixed BF16 checkpoint optimized for lower VRAM usage |
The original `.pt` checkpoint contains training-related state that is unnecessary for inference.
Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior.
## Intended Use
This model is intended for:
- Zero-shot text-to-speech
- Voice continuation
- Reference-audio conditioned generation
- Voice adaptation from short prompts
- ComfyUI speech workflows
- Local offline TTS generation
A transcript of the reference audio is required by WavTTS.
## Precision Notes
### FP32
Recommended for maximum stability.
### Mixed BF16
Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32.
No model weights have been retrained or modified beyond precision conversion.
## Architecture
WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens.
### Inputs
- Reference audio
- Reference transcript
- Target text
### Output
- 16 kHz synthesized waveform
The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style.
## Limitations
- Requires a transcript for the reference audio.
- Voice similarity is not guaranteed.
- Long generations may require chunking.
- Audio quality depends heavily on reference quality.
- Commercial usage may be restricted by the upstream license.
- Numerical differences may occur between original and converted checkpoints.
## Attribution
### Original Authors
**WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling**
Official repository:
https://github.com/cwx-worst-one/WavTTS
### ComfyUI Integration
https://github.com/Saganaki22/WavTTS-ComfyUI
### Dataset
https://huggingface.co/datasets/amphion/Emilia-Dataset
## License
This Safetensors conversion inherits the licensing terms of the original WavTTS release.
The original model weights are licensed under **CC BY-NC 4.0**.
Please review the upstream model card and license terms before redistribution or commercial use.
## Citation
```bibtex
@article{chen2026wavtts,
title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling},
author={TODO},
journal={TODO},
year={2026}
}
```