Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ <!-- # EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting -->
5
+ <div align="center">
6
+ <p align="center">
7
+ <h1>EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting</h1>
8
+ <!-- <a href=>Paper</a> | <a href="https://meanaudio.github.io/">Webpage</a> -->
9
+
10
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2504.12867)
11
+ [![Code](https://img.shields.io/badge/Code-Repo-black?style=flat&logo=github&logoColor=white)](https://github.com/yanghaha0908/EmoVoice?tab=readme-ov-file)
12
+ [![Hugging Face Space](https://img.shields.io/badge/Space-HuggingFace-orange?logo=huggingface)](https://huggingface.co/spaces/chenxie95/EmoVoice)
13
+ [![Hugging Face Dataset](https://img.shields.io/badge/Dataset-HuggingFace-green?logo=huggingface)](https://huggingface.co/datasets/yhaha/EmoVoice-DB)
14
+ [![Webpage](https://img.shields.io/badge/Website-DemoPage-pink?logo=googlechrome&logoColor=white)](https://yanghaha0908.github.io/EmoVoice/)
15
+
16
+
17
+ </p>
18
+ </div>
19
+
20
+
21
+ ## Overview
22
+
23
+ EmoVoice is a emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control. EmoVoice achieves SOTA performance on English EmoVoice-DB and Chinese Secap test sets.
24
+ <!-- ### Model
25
+
26
+ <div align="center">
27
+ <img src="pics/emovoice_overview.png" alt="" width="500">
28
+ </div>
29
+
30
+ ### Performance
31
+
32
+ <table width="100%">
33
+ <tr>
34
+ <td align="center">
35
+ <img src="pics/table2.png" alt="图片描述1" width="333">
36
+ </td>
37
+ <td align="center">
38
+ <img src="pics/table3.png" alt="图片描述2" width="333">
39
+ </td>
40
+ <td align="center">
41
+ <img src="pics/table4.png" alt="图片描述3" width="333">
42
+ </td>
43
+ </tr>
44
+ </table>
45
+ -->
46
+
47
+ <!-- ## Environmental Setup
48
+ ```bash
49
+ ### Create a separate environment if needed
50
+
51
+ conda create -n EmoVoice python=3.10
52
+ conda activate EmoVoice
53
+ pip install -r requirements.txt
54
+ ```
55
+
56
+ ## Train and Inference
57
+ ### Infer with checkpoints
58
+ ```bash
59
+ bash examples/tts/scripts/inference_EmoVoice.sh
60
+ bash examples/tts/scripts/inference_EmoVoice-PP.sh
61
+ bash examples/tts/scripts/inference_EmoVoice_1.5B.sh
62
+ ```
63
+ ### Train from scratch
64
+ ```bash
65
+ # First Stage: Pretrain TTS
66
+ bash examples/tts/scripts/pretrain_EmoVoice.sh
67
+ bash examples/tts/scripts/pretrain_EmoVoice-PP.sh
68
+ bash examples/tts/scripts/pretrain_EmoVoice_1.5B.sh
69
+
70
+ # Second Stage: Finetune Emotional TTS
71
+ bash examples/tts/scripts/ft_EmoVoice.sh
72
+ bash examples/tts/scripts/ft_EmoVoice-PP.sh
73
+ bash examples/tts/scripts/ft_EmoVoice_1.5B.sh
74
+ ``` -->
75
+
76
+ ### Checkpoints
77
+ English model checkpoints of EmoVoice(0.5B), EmoVoice(1.5B) and EmoVoice-PP(0.5B) are uploaded.
78
+ Qwen2.5-0.5B-phn, the Qwen2.5-0.5B tokenizer with a phoneme-extended vocabulary, is uploaded.
79
+
80
+ <!-- - Model Checkpoints can be found on hugging face: https://huggingface.co/yhaha/EmoVoice. -->
81
+ <!-- [EmoVoice](https://drive.google.com/file/d/1WLVshIIaAXtP0wrRPd7KUeomuNIwWL96/view?usp=sharing)
82
+ [EmoVoice-PP](https://drive.google.com/file/d/1NSDW8dsxXMdwPeoOdmAyiK3ueLgnePnN/view?usp=sharing) -->
83
+
84
+ <!-- ### Datasets
85
+
86
+ - Datasets for Pretraining TTS: [VoiceAssistant](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni) and [Belle](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni).
87
+ - Datasets for Finetuning Emotional TTS: [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB) and part of [laions_got_talent](https://huggingface.co/datasets/laion/laions_got_talent)(the part we use is also uploaded to [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB)).
88
+
89
+ -->
90
+ <!-- ## Acknowledgements
91
+ - Our codes is built on [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM).
92
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) valuable repo.
93
+ -->
94
+ <!-- ## [Paper](https://arxiv.org/abs/2504.12867); [Demo Page](https://yanghaha0908.github.io/EmoVoice/); -->
95
+
96
+ ## Citation
97
+ If our work is useful for you, please cite as:
98
+ ```
99
+ @article{yang2025emovoice,
100
+ title={EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting},
101
+ author={Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and Niu, Zhikang and Liu, Wenrui and others},
102
+ journal={arXiv preprint arXiv:2504.12867},
103
+ year={2025}
104
+ }
105
+ ```
106
+ <!-- Paper link: https://arxiv.org/abs/2504.12867 -->
107
+ <!-- ## License
108
+
109
+ Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.
110
+
111
+ -->