EmoVoice / README.md
yhaha's picture
Update Readme.md (#2)
88c47d5 verified
---
license: mit
---
<!-- # EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting -->
<div align="center">
<p align="center">
<h1>EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting</h1>
<!-- <a href=>Paper</a> | <a href="https://meanaudio.github.io/">Webpage</a> -->
[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2504.12867)
[![Code](https://img.shields.io/badge/Code-Repo-black?style=flat&logo=github&logoColor=white)](https://github.com/yanghaha0908/EmoVoice?tab=readme-ov-file)
[![Hugging Face Space](https://img.shields.io/badge/Space-HuggingFace-orange?logo=huggingface)](https://huggingface.co/spaces/chenxie95/EmoVoice)
[![Hugging Face Dataset](https://img.shields.io/badge/Dataset-HuggingFace-green?logo=huggingface)](https://huggingface.co/datasets/yhaha/EmoVoice-DB)
[![Webpage](https://img.shields.io/badge/Website-DemoPage-pink?logo=googlechrome&logoColor=white)](https://yanghaha0908.github.io/EmoVoice/)
</p>
</div>
## Overview
EmoVoice is a emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control. EmoVoice achieves SOTA performance on English EmoVoice-DB and Chinese Secap test sets.
<!-- ### Model
<div align="center">
<img src="pics/emovoice_overview.png" alt="" width="500">
</div>
### Performance
<table width="100%">
<tr>
<td align="center">
<img src="pics/table2.png" alt="图片描述1" width="333">
</td>
<td align="center">
<img src="pics/table3.png" alt="图片描述2" width="333">
</td>
<td align="center">
<img src="pics/table4.png" alt="图片描述3" width="333">
</td>
</tr>
</table>
-->
<!-- ## Environmental Setup
```bash
### Create a separate environment if needed
conda create -n EmoVoice python=3.10
conda activate EmoVoice
pip install -r requirements.txt
```
## Train and Inference
### Infer with checkpoints
```bash
bash examples/tts/scripts/inference_EmoVoice.sh
bash examples/tts/scripts/inference_EmoVoice-PP.sh
bash examples/tts/scripts/inference_EmoVoice_1.5B.sh
```
### Train from scratch
```bash
# First Stage: Pretrain TTS
bash examples/tts/scripts/pretrain_EmoVoice.sh
bash examples/tts/scripts/pretrain_EmoVoice-PP.sh
bash examples/tts/scripts/pretrain_EmoVoice_1.5B.sh
# Second Stage: Finetune Emotional TTS
bash examples/tts/scripts/ft_EmoVoice.sh
bash examples/tts/scripts/ft_EmoVoice-PP.sh
bash examples/tts/scripts/ft_EmoVoice_1.5B.sh
``` -->
### Checkpoints
English model checkpoints of EmoVoice(0.5B), EmoVoice(1.5B) and EmoVoice-PP(0.5B) are uploaded.
Qwen2.5-0.5B-phn, the Qwen2.5-0.5B tokenizer with a phoneme-extended vocabulary, is uploaded.
<!-- - Model Checkpoints can be found on hugging face: https://huggingface.co/yhaha/EmoVoice. -->
<!-- [EmoVoice](https://drive.google.com/file/d/1WLVshIIaAXtP0wrRPd7KUeomuNIwWL96/view?usp=sharing)
[EmoVoice-PP](https://drive.google.com/file/d/1NSDW8dsxXMdwPeoOdmAyiK3ueLgnePnN/view?usp=sharing) -->
<!-- ### Datasets
- Datasets for Pretraining TTS: [VoiceAssistant](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni) and [Belle](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni).
- Datasets for Finetuning Emotional TTS: [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB) and part of [laions_got_talent](https://huggingface.co/datasets/laion/laions_got_talent)(the part we use is also uploaded to [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB)).
-->
<!-- ## Acknowledgements
- Our codes is built on [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM).
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) valuable repo.
-->
<!-- ## [Paper](https://arxiv.org/abs/2504.12867); [Demo Page](https://yanghaha0908.github.io/EmoVoice/); -->
## Citation
If our work is useful for you, please cite as:
```
@article{yang2025emovoice,
title={EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting},
author={Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and Niu, Zhikang and Liu, Wenrui and others},
journal={arXiv preprint arXiv:2504.12867},
year={2025}
}
```
<!-- Paper link: https://arxiv.org/abs/2504.12867 -->
<!-- ## License
Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.
-->