| --- |
| license: cc-by-nc-4.0 |
| language: |
| - en |
| - zh |
| pipeline_tag: text-to-speech |
| tags: |
| - text-to-speech |
| - zero-shot-tts |
| - waveform-generation |
| datasets: |
| - amphion/Emilia-Dataset |
| --- |
| |
| # WavTTS |
|
|
| This repository hosts the official WavTTS checkpoint for zero-shot text-to-speech generation at 16 kHz. Please refer to the [GitHub repository](https://github.com/cwx-worst-one/WavTTS) and [paper](https://arxiv.org/abs/2606.03455) for more details. |
|
|
| ## Files |
|
|
| - `model_1200000.pt`: official WavTTS checkpoint. |
| - `vocab.txt`: matching vocabulary file for the released checkpoint. |
|
|
| ## Usage |
|
|
| Please use this checkpoint with the WavTTS codebase: |
|
|
| ```bash |
| git clone https://github.com/cwx-worst-one/WavTTS |
| cd WavTTS |
| pip install -e . |
| ``` |
|
|
| Run inference with the default checkpoint: |
|
|
| ```bash |
| wavtts_infer-cli \ |
| --model WavTTS \ |
| --ref_audio "path/to/reference.wav" \ |
| --ref_text "The transcription of the reference audio." \ |
| --gen_text "The text you want to synthesize." |
| ``` |
|
|
| The default WavTTS configuration downloads `model_1200000.pt` from this repository automatically. To use the files explicitly, set: |
|
|
| ```toml |
| ckpt_file = "hf://worstchan/WavTTS/model_1200000.pt" |
| vocab_file = "infer/examples/vocab.txt" |
| ``` |
|
|
| ## License |
|
|
| The released model weights are licensed under CC BY-NC 4.0 due to the license restrictions of the Emilia training dataset. The WavTTS codebase is released under the MIT License. |
|
|
| ## Citation |
|
|
| If you find WavTTS useful, please cite the paper: |
|
|
| ```bibtex |
| @article{chen2026wavtts, |
| title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling}, |
| author={TODO}, |
| journal={TODO}, |
| year={2026} |
| } |
| ``` |