|
|
--- |
|
|
|
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- eng |
|
|
- zho |
|
|
tags: |
|
|
- tts |
|
|
- text-to-speech |
|
|
- speech-synthesis |
|
|
- voice-cloning |
|
|
library_name: ttsdb |
|
|
pipeline_tag: text-to-speech |
|
|
|
|
|
--- |
|
|
|
|
|
# E2 TTS |
|
|
|
|
|
> **This is a mirror of the original weights for use with [TTSDB](https://github.com/ttsds/ttsdb).** |
|
|
> |
|
|
> Original weights: [https://huggingface.co/SWivid/E2-TTS](https://huggingface.co/SWivid/E2-TTS) |
|
|
> Original code: [https://github.com/SWivid/F5-TTS](https://github.com/SWivid/F5-TTS) |
|
|
|
|
|
|
|
|
A non-autoregressive masked U-Net transformer text-to-speech model. |
|
|
|
|
|
|
|
|
|
|
|
## Original Work |
|
|
|
|
|
This model was created by the original authors. Please cite their work if you use this model: |
|
|
|
|
|
|
|
|
```bibtex |
|
|
@inproceedings{e2-tts, |
|
|
title={{E2 TTS}: Embarrassingly easy fully non-autoregressive zero-shot tts}, |
|
|
author={Eskimez, Sefik Emre and Wang, Xiaofei and Thakker, Manthan and Li, Canrun and Tsai, Chung-Hsien and Xiao, Zhen and Yang, Hemin and Zhu, Zirun and Tang, Min and Tan, Xu and others}, |
|
|
booktitle={2024 IEEE Spoken Language Technology Workshop (SLT)}, |
|
|
pages={682--689}, |
|
|
year={2024}, |
|
|
organization={IEEE} |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
**Papers:** |
|
|
|
|
|
- https://ieeexplore.ieee.org/abstract/document/10832320 |
|
|
|
|
|
|
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install ttsdb-e2-tts |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from ttsdb_e2_tts import E2TTS |
|
|
|
|
|
# Load the model (downloads weights automatically) |
|
|
model = E2TTS(model_id="ttsds/E2 TTS") |
|
|
|
|
|
# Synthesize speech |
|
|
audio, sample_rate = model.synthesize( |
|
|
text="Hello, this is a test of E2 TTS.", |
|
|
reference_audio="path/to/reference.wav", |
|
|
text_reference="Transcript of the reference audio.", |
|
|
language="en", |
|
|
) |
|
|
|
|
|
# Save the output |
|
|
model.save_audio(audio, sample_rate, "output.wav") |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Sample Rate** | 24000 Hz | |
|
|
| **Parameters** | 335M | |
|
|
| **Architecture** | Non-Autoregressive, Masked, Flow Matching, U-Net Transformer | |
|
|
| **Languages** | English, Chinese | |
|
|
| **Release Date** | 2024-10-30 | |
|
|
|
|
|
|
|
|
### Training Data |
|
|
|
|
|
|
|
|
- [Emilia Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset) (100000 hours) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
- **Weights:** Creative Commons Attribution-NonCommercial 4.0 |
|
|
- **Code:** MIT License |
|
|
|
|
|
Please refer to the original repositories for full license terms. |
|
|
|
|
|
## Links |
|
|
|
|
|
- **Original Code:** [https://github.com/SWivid/F5-TTS](https://github.com/SWivid/F5-TTS) |
|
|
- **Original Weights:** [https://huggingface.co/SWivid/E2-TTS](https://huggingface.co/SWivid/E2-TTS) |
|
|
- **TTSDB Package:** [ttsdb-e2-tts](https://pypi.org/project/ttsdb-e2-tts/) |
|
|
- **TTSDB GitHub:** [https://github.com/ttsds/ttsdb](https://github.com/ttsds/ttsdb) |
|
|
|