File size: 4,615 Bytes
88c47d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
license: mit
---
<!-- # EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting -->
<div align="center">
<p align="center">
<h1>EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting</h1>
<!-- <a href=>Paper</a> | <a href="https://meanaudio.github.io/">Webpage</a> -->
[](https://arxiv.org/abs/2504.12867)
[](https://github.com/yanghaha0908/EmoVoice?tab=readme-ov-file)
[](https://huggingface.co/spaces/chenxie95/EmoVoice)
[](https://huggingface.co/datasets/yhaha/EmoVoice-DB)
[](https://yanghaha0908.github.io/EmoVoice/)
</p>
</div>
## Overview
EmoVoice is a emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control. EmoVoice achieves SOTA performance on English EmoVoice-DB and Chinese Secap test sets.
<!-- ### Model
<div align="center">
<img src="pics/emovoice_overview.png" alt="" width="500">
</div>
### Performance
<table width="100%">
<tr>
<td align="center">
<img src="pics/table2.png" alt="图片描述1" width="333">
</td>
<td align="center">
<img src="pics/table3.png" alt="图片描述2" width="333">
</td>
<td align="center">
<img src="pics/table4.png" alt="图片描述3" width="333">
</td>
</tr>
</table>
-->
<!-- ## Environmental Setup
```bash
### Create a separate environment if needed
conda create -n EmoVoice python=3.10
conda activate EmoVoice
pip install -r requirements.txt
```
## Train and Inference
### Infer with checkpoints
```bash
bash examples/tts/scripts/inference_EmoVoice.sh
bash examples/tts/scripts/inference_EmoVoice-PP.sh
bash examples/tts/scripts/inference_EmoVoice_1.5B.sh
```
### Train from scratch
```bash
# First Stage: Pretrain TTS
bash examples/tts/scripts/pretrain_EmoVoice.sh
bash examples/tts/scripts/pretrain_EmoVoice-PP.sh
bash examples/tts/scripts/pretrain_EmoVoice_1.5B.sh
# Second Stage: Finetune Emotional TTS
bash examples/tts/scripts/ft_EmoVoice.sh
bash examples/tts/scripts/ft_EmoVoice-PP.sh
bash examples/tts/scripts/ft_EmoVoice_1.5B.sh
``` -->
### Checkpoints
English model checkpoints of EmoVoice(0.5B), EmoVoice(1.5B) and EmoVoice-PP(0.5B) are uploaded.
Qwen2.5-0.5B-phn, the Qwen2.5-0.5B tokenizer with a phoneme-extended vocabulary, is uploaded.
<!-- - Model Checkpoints can be found on hugging face: https://huggingface.co/yhaha/EmoVoice. -->
<!-- [EmoVoice](https://drive.google.com/file/d/1WLVshIIaAXtP0wrRPd7KUeomuNIwWL96/view?usp=sharing)
[EmoVoice-PP](https://drive.google.com/file/d/1NSDW8dsxXMdwPeoOdmAyiK3ueLgnePnN/view?usp=sharing) -->
<!-- ### Datasets
- Datasets for Pretraining TTS: [VoiceAssistant](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni) and [Belle](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni).
- Datasets for Finetuning Emotional TTS: [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB) and part of [laions_got_talent](https://huggingface.co/datasets/laion/laions_got_talent)(the part we use is also uploaded to [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB)).
-->
<!-- ## Acknowledgements
- Our codes is built on [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM).
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) valuable repo.
-->
<!-- ## [Paper](https://arxiv.org/abs/2504.12867); [Demo Page](https://yanghaha0908.github.io/EmoVoice/); -->
## Citation
If our work is useful for you, please cite as:
```
@article{yang2025emovoice,
title={EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting},
author={Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and Niu, Zhikang and Liu, Wenrui and others},
journal={arXiv preprint arXiv:2504.12867},
year={2025}
}
```
<!-- Paper link: https://arxiv.org/abs/2504.12867 -->
<!-- ## License
Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.
-->
|