File size: 4,519 Bytes

---
license: cc-by-nc-4.0
language:
- en
- zh
pipeline_tag: text-to-speech
tags:
- text-to-speech
- zero-shot-tts
- waveform-generation
- comfyui
- safetensors
- fp32
- bf16
- wavtts
- voice-cloning
- reference-audio
datasets:
- amphion/Emilia-Dataset
---

# WavTTS 16k Safetensors for ComfyUI

Safetensors conversion of the official **WavTTS** model for use with **WavTTS-ComfyUI**.

![Screenshot 2026-06-04 015737](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/vv1P0WoTgUfWi6MFaYJoH.png)

This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead.

## Model Introduction

WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting.

Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples.

### Links

- **ComfyUI Node:** https://github.com/Saganaki22/WavTTS-ComfyUI
- **Official Repository:** https://github.com/cwx-worst-one/WavTTS
- **Research Paper:** https://arxiv.org/abs/2606.03455

## Usage

This checkpoint is intended for use with:

https://github.com/Saganaki22/WavTTS-ComfyUI

Place the model inside:

```text
ComfyUI/models/wavtts/
```

Example:

```text
ComfyUI/models/wavtts/
├── wavtts-fp32.safetensors
└── wavtts-mixed-bf16.safetensors
```

Restart ComfyUI and load the checkpoint using the **WavTTS Load Model** node.

## Model Summary

| Item | Value |
|--------|--------|
| Model | WavTTS |
| Format | Safetensors |
| Task | Zero-Shot Text-to-Speech |
| Sample Rate | 16 kHz |
| Architecture | Direct Waveform Generation |
| Conditioning | Reference Audio + Reference Transcript |
| Intended Platform | ComfyUI |
| Languages | English, Chinese |
| Training Dataset | Emilia Dataset |
| License | CC BY-NC 4.0 |

## Included Variants

| File | Description |
|--------|--------|
| wavtts-fp32.safetensors | Clean FP32 inference checkpoint |
| wavtts-mixed-bf16.safetensors | Mixed BF16 checkpoint optimized for lower VRAM usage |

The original `.pt` checkpoint contains training-related state that is unnecessary for inference.

Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior.

## Intended Use

This model is intended for:

- Zero-shot text-to-speech
- Voice continuation
- Reference-audio conditioned generation
- Voice adaptation from short prompts
- ComfyUI speech workflows
- Local offline TTS generation

A transcript of the reference audio is required by WavTTS.

## Precision Notes

### FP32

Recommended for maximum stability.

### Mixed BF16

Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32.

No model weights have been retrained or modified beyond precision conversion.

## Architecture

WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens.

### Inputs

- Reference audio
- Reference transcript
- Target text

### Output

- 16 kHz synthesized waveform

The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style.

## Limitations

- Requires a transcript for the reference audio.
- Voice similarity is not guaranteed.
- Long generations may require chunking.
- Audio quality depends heavily on reference quality.
- Commercial usage may be restricted by the upstream license.
- Numerical differences may occur between original and converted checkpoints.

## Attribution

### Original Authors

**WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling**

Official repository:

https://github.com/cwx-worst-one/WavTTS

### ComfyUI Integration

https://github.com/Saganaki22/WavTTS-ComfyUI

### Dataset

https://huggingface.co/datasets/amphion/Emilia-Dataset

## License

This Safetensors conversion inherits the licensing terms of the original WavTTS release.

The original model weights are licensed under **CC BY-NC 4.0**.

Please review the upstream model card and license terms before redistribution or commercial use.

## Citation

```bibtex
@article{chen2026wavtts,
  title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling},
  author={TODO},
  journal={TODO},
  year={2026}
}
```