--- license: cc-by-nc-4.0 language: - en - zh pipeline_tag: text-to-speech tags: - text-to-speech - zero-shot-tts - waveform-generation - comfyui - safetensors - fp32 - bf16 - wavtts - voice-cloning - reference-audio datasets: - amphion/Emilia-Dataset --- # WavTTS 16k Safetensors for ComfyUI Safetensors conversion of the official **WavTTS** model for use with **WavTTS-ComfyUI**. ![Screenshot 2026-06-04 015737](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/vv1P0WoTgUfWi6MFaYJoH.png) This release converts the original PyTorch checkpoint into the Safetensors format for safer loading, improved compatibility with ComfyUI workflows, and reduced storage overhead. ## Model Introduction WavTTS is a zero-shot text-to-speech model that directly generates raw audio waveforms from text using reference-audio prompting. Unlike token-based TTS systems that generate intermediate acoustic representations, WavTTS models speech directly in the waveform domain, enabling highly natural speech synthesis while preserving speaker characteristics from short reference samples. ### Links - **ComfyUI Node:** https://github.com/Saganaki22/WavTTS-ComfyUI - **Official Repository:** https://github.com/cwx-worst-one/WavTTS - **Research Paper:** https://arxiv.org/abs/2606.03455 ## Usage This checkpoint is intended for use with: https://github.com/Saganaki22/WavTTS-ComfyUI Place the model inside: ```text ComfyUI/models/wavtts/ ``` Example: ```text ComfyUI/models/wavtts/ ├── wavtts-fp32.safetensors └── wavtts-mixed-bf16.safetensors ``` Restart ComfyUI and load the checkpoint using the **WavTTS Load Model** node. ## Model Summary | Item | Value | |--------|--------| | Model | WavTTS | | Format | Safetensors | | Task | Zero-Shot Text-to-Speech | | Sample Rate | 16 kHz | | Architecture | Direct Waveform Generation | | Conditioning | Reference Audio + Reference Transcript | | Intended Platform | ComfyUI | | Languages | English, Chinese | | Training Dataset | Emilia Dataset | | License | CC BY-NC 4.0 | ## Included Variants | File | Description | |--------|--------| | wavtts-fp32.safetensors | Clean FP32 inference checkpoint | | wavtts-mixed-bf16.safetensors | Mixed BF16 checkpoint optimized for lower VRAM usage | The original `.pt` checkpoint contains training-related state that is unnecessary for inference. Safetensors releases store inference weights only, resulting in smaller file sizes and safer loading behavior. ## Intended Use This model is intended for: - Zero-shot text-to-speech - Voice continuation - Reference-audio conditioned generation - Voice adaptation from short prompts - ComfyUI speech workflows - Local offline TTS generation A transcript of the reference audio is required by WavTTS. ## Precision Notes ### FP32 Recommended for maximum stability. ### Mixed BF16 Recommended for reduced VRAM usage while preserving numerically sensitive tensors in FP32. No model weights have been retrained or modified beyond precision conversion. ## Architecture WavTTS performs direct waveform modeling rather than generating intermediate acoustic tokens. ### Inputs - Reference audio - Reference transcript - Target text ### Output - 16 kHz synthesized waveform The model learns speaker characteristics directly from reference audio and generates speech matching the target text in the reference voice style. ## Limitations - Requires a transcript for the reference audio. - Voice similarity is not guaranteed. - Long generations may require chunking. - Audio quality depends heavily on reference quality. - Commercial usage may be restricted by the upstream license. - Numerical differences may occur between original and converted checkpoints. ## Attribution ### Original Authors **WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling** Official repository: https://github.com/cwx-worst-one/WavTTS ### ComfyUI Integration https://github.com/Saganaki22/WavTTS-ComfyUI ### Dataset https://huggingface.co/datasets/amphion/Emilia-Dataset ## License This Safetensors conversion inherits the licensing terms of the original WavTTS release. The original model weights are licensed under **CC BY-NC 4.0**. Please review the upstream model card and license terms before redistribution or commercial use. ## Citation ```bibtex @article{chen2026wavtts, title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling}, author={TODO}, journal={TODO}, year={2026} } ```