WavTTS / README.md

Update README.md

8d3712f verified about 21 hours ago

4.52 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	- zh
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- zero-shot-tts
	- waveform-generation
	- comfyui
	- safetensors
	- fp32
	- bf16
	- wavtts
	- voice-cloning
	- reference-audio
	datasets:
	- amphion/Emilia-Dataset
	---

	# WavTTS 16k Safetensors for ComfyUI

	Safetensors conversion of the official WavTTS model for use with WavTTS-ComfyUI.

	![Screenshot 2026-06-04 015737](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/vv1P0WoTgUfWi6MFaYJoH.png)

	This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead.

	## Model Introduction

	WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting.

	Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples.

	### Links

	- ComfyUI Node: https://github.com/Saganaki22/WavTTS-ComfyUI
	- Official Repository: https://github.com/cwx-worst-one/WavTTS
	- Research Paper: https://arxiv.org/abs/2606.03455

	## Usage

	This checkpoint is intended for use with:

	https://github.com/Saganaki22/WavTTS-ComfyUI

	Place the model inside:

	```text
	ComfyUI/models/wavtts/
	```

	Example:

	```text
	ComfyUI/models/wavtts/
	├── wavtts-fp32.safetensors
	└── wavtts-mixed-bf16.safetensors
	```

	Restart ComfyUI and load the checkpoint using the WavTTS Load Model node.

	## Model Summary

	\| Item \| Value \|
	\|--------\|--------\|
	\| Model \| WavTTS \|
	\| Format \| Safetensors \|
	\| Task \| Zero-Shot Text-to-Speech \|
	\| Sample Rate \| 16 kHz \|
	\| Architecture \| Direct Waveform Generation \|
	\| Conditioning \| Reference Audio + Reference Transcript \|
	\| Intended Platform \| ComfyUI \|
	\| Languages \| English, Chinese \|
	\| Training Dataset \| Emilia Dataset \|
	\| License \| CC BY-NC 4.0 \|

	## Included Variants

	\| File \| Description \|
	\|--------\|--------\|
	\| wavtts-fp32.safetensors \| Clean FP32 inference checkpoint \|
	\| wavtts-mixed-bf16.safetensors \| Mixed BF16 checkpoint optimized for lower VRAM usage \|

	The original `.pt` checkpoint contains training-related state that is unnecessary for inference.

	Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior.

	## Intended Use

	This model is intended for:

	- Zero-shot text-to-speech
	- Voice continuation
	- Reference-audio conditioned generation
	- Voice adaptation from short prompts
	- ComfyUI speech workflows
	- Local offline TTS generation

	A transcript of the reference audio is required by WavTTS.

	## Precision Notes

	### FP32

	Recommended for maximum stability.

	### Mixed BF16

	Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32.

	No model weights have been retrained or modified beyond precision conversion.

	## Architecture

	WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens.

	### Inputs

	- Reference audio
	- Reference transcript
	- Target text

	### Output

	- 16 kHz synthesized waveform

	The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style.

	## Limitations

	- Requires a transcript for the reference audio.
	- Voice similarity is not guaranteed.
	- Long generations may require chunking.
	- Audio quality depends heavily on reference quality.
	- Commercial usage may be restricted by the upstream license.
	- Numerical differences may occur between original and converted checkpoints.

	## Attribution

	### Original Authors

	WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

	Official repository:

	https://github.com/cwx-worst-one/WavTTS

	### ComfyUI Integration

	https://github.com/Saganaki22/WavTTS-ComfyUI

	### Dataset

	https://huggingface.co/datasets/amphion/Emilia-Dataset

	## License

	This Safetensors conversion inherits the licensing terms of the original WavTTS release.

	The original model weights are licensed under CC BY-NC 4.0.

	Please review the upstream model card and license terms before redistribution or commercial use.

	## Citation

	```bibtex
	@article{chen2026wavtts,
	title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling},
	author={TODO},
	journal={TODO},
	year={2026}
	}
	```