metadata
title: E2 TTS
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: cc-by-nc-4.0
E2 TTS Text-to-Speech
A non-autoregressive masked U-Net transformer text-to-speech model.
Features
- Zero-shot voice cloning
- Multiple language support: English, Chinese
- High-quality 24kHz audio output
Usage
- Upload a reference audio clip (3-10 seconds recommended)
- Enter the transcript of the reference audio
- Enter the text you want to synthesize
- Select the language
- Click "Synthesize"
Model Information
- Architecture: Non-Autoregressive, Masked, Flow Matching, U-Net Transformer
- Sample Rate: 24000 Hz
- Parameters: 335M
Citation
@inproceedings{e2-tts,
title={{E2 TTS}: Embarrassingly easy fully non-autoregressive zero-shot tts},
author={Eskimez, Sefik Emre and Wang, Xiaofei and Thakker, Manthan and Li, Canrun and Tsai, Chung-Hsien and Xiao, Zhen and Yang, Hemin and Zhu, Zirun and Tang, Min and Tan, Xu and others},
booktitle={2024 IEEE Spoken Language Technology Workshop (SLT)},
pages={682--689},
year={2024},
organization={IEEE}
}