WavTTS / README.md

Update README.md

8d3712f verified about 13 hours ago

4.52 kB

license: cc-by-nc-4.0
language:
  - en
  - zh
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - zero-shot-tts
  - waveform-generation
  - comfyui
  - safetensors
  - fp32
  - bf16
  - wavtts
  - voice-cloning
  - reference-audio
datasets:
  - amphion/Emilia-Dataset

WavTTS 16k Safetensors for ComfyUI

Safetensors conversion of the official WavTTS model for use with WavTTS-ComfyUI.

This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead.

Model Introduction

WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting.

Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples.

Usage

This checkpoint is intended for use with:

https://github.com/Saganaki22/WavTTS-ComfyUI

Place the model inside:

ComfyUI/models/wavtts/

Example:

ComfyUI/models/wavtts/
├── wavtts-fp32.safetensors
└── wavtts-mixed-bf16.safetensors

Restart ComfyUI and load the checkpoint using the WavTTS Load Model node.

Model Summary

Item	Value
Model	WavTTS
Format	Safetensors
Task	Zero-Shot Text-to-Speech
Sample Rate	16 kHz
Architecture	Direct Waveform Generation
Conditioning	Reference Audio + Reference Transcript
Intended Platform	ComfyUI
Languages	English, Chinese
Training Dataset	Emilia Dataset
License	CC BY-NC 4.0

Included Variants

File	Description
wavtts-fp32.safetensors	Clean FP32 inference checkpoint
wavtts-mixed-bf16.safetensors	Mixed BF16 checkpoint optimized for lower VRAM usage

The original .pt checkpoint contains training-related state that is unnecessary for inference.

Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior.

Intended Use

This model is intended for:

Zero-shot text-to-speech
Voice continuation
Reference-audio conditioned generation
Voice adaptation from short prompts
ComfyUI speech workflows
Local offline TTS generation

A transcript of the reference audio is required by WavTTS.

Precision Notes

FP32

Recommended for maximum stability.

Mixed BF16

Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32.

No model weights have been retrained or modified beyond precision conversion.

Architecture

WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens.

Inputs

Reference audio
Reference transcript
Target text

Output

16 kHz synthesized waveform

The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style.

Limitations

Requires a transcript for the reference audio.
Voice similarity is not guaranteed.
Long generations may require chunking.
Audio quality depends heavily on reference quality.
Commercial usage may be restricted by the upstream license.
Numerical differences may occur between original and converted checkpoints.

Attribution

Original Authors

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

Official repository:

https://github.com/cwx-worst-one/WavTTS

ComfyUI Integration

https://github.com/Saganaki22/WavTTS-ComfyUI

Dataset

https://huggingface.co/datasets/amphion/Emilia-Dataset

License

This Safetensors conversion inherits the licensing terms of the original WavTTS release.

The original model weights are licensed under CC BY-NC 4.0.

Please review the upstream model card and license terms before redistribution or commercial use.

Citation

@article{chen2026wavtts,
  title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling},
  author={TODO},
  journal={TODO},
  year={2026}
}

drbaph
/

WavTTS