EmoVoice / README.md

Update Readme.md (#2)

88c47d5 verified 3 months ago

4.62 kB

	---
	license: mit
	---
	<!-- # EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting -->
	<div align="center">
	<p align="center">
	<h1>EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting</h1>
	<!-- <a href=>Paper</a> \| <a href="https://meanaudio.github.io/">Webpage</a> -->

	[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2504.12867)
	[![Code](https://img.shields.io/badge/Code-Repo-black?style=flat&logo=github&logoColor=white)](https://github.com/yanghaha0908/EmoVoice?tab=readme-ov-file)
	[![Hugging Face Space](https://img.shields.io/badge/Space-HuggingFace-orange?logo=huggingface)](https://huggingface.co/spaces/chenxie95/EmoVoice)
	[![Hugging Face Dataset](https://img.shields.io/badge/Dataset-HuggingFace-green?logo=huggingface)](https://huggingface.co/datasets/yhaha/EmoVoice-DB)
	[![Webpage](https://img.shields.io/badge/Website-DemoPage-pink?logo=googlechrome&logoColor=white)](https://yanghaha0908.github.io/EmoVoice/)


	</p>
	</div>


	## Overview

	EmoVoice is a emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control. EmoVoice achieves SOTA performance on English EmoVoice-DB and Chinese Secap test sets.
	<!-- ### Model

	<div align="center">
	<img src="pics/emovoice_overview.png" alt="" width="500">
	</div>

	### Performance

	<table width="100%">
	<tr>
	<td align="center">
	<img src="pics/table2.png" alt="图片描述1" width="333">
	</td>
	<td align="center">
	<img src="pics/table3.png" alt="图片描述2" width="333">
	</td>
	<td align="center">
	<img src="pics/table4.png" alt="图片描述3" width="333">
	</td>
	</tr>
	</table>
	-->

	<!-- ## Environmental Setup
	```bash
	### Create a separate environment if needed

	conda create -n EmoVoice python=3.10
	conda activate EmoVoice
	pip install -r requirements.txt
	```

	## Train and Inference
	### Infer with checkpoints
	```bash
	bash examples/tts/scripts/inference_EmoVoice.sh
	bash examples/tts/scripts/inference_EmoVoice-PP.sh
	bash examples/tts/scripts/inference_EmoVoice_1.5B.sh
	```
	### Train from scratch
	```bash
	# First Stage: Pretrain TTS
	bash examples/tts/scripts/pretrain_EmoVoice.sh
	bash examples/tts/scripts/pretrain_EmoVoice-PP.sh
	bash examples/tts/scripts/pretrain_EmoVoice_1.5B.sh

	# Second Stage: Finetune Emotional TTS
	bash examples/tts/scripts/ft_EmoVoice.sh
	bash examples/tts/scripts/ft_EmoVoice-PP.sh
	bash examples/tts/scripts/ft_EmoVoice_1.5B.sh
	``` -->

	### Checkpoints
	English model checkpoints of EmoVoice(0.5B), EmoVoice(1.5B) and EmoVoice-PP(0.5B) are uploaded.
	Qwen2.5-0.5B-phn, the Qwen2.5-0.5B tokenizer with a phoneme-extended vocabulary, is uploaded.

	<!-- - Model Checkpoints can be found on hugging face: https://huggingface.co/yhaha/EmoVoice. -->
	<!-- [EmoVoice](https://drive.google.com/file/d/1WLVshIIaAXtP0wrRPd7KUeomuNIwWL96/view?usp=sharing)
	[EmoVoice-PP](https://drive.google.com/file/d/1NSDW8dsxXMdwPeoOdmAyiK3ueLgnePnN/view?usp=sharing) -->

	<!-- ### Datasets

	- Datasets for Pretraining TTS: [VoiceAssistant](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni) and [Belle](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni).
	- Datasets for Finetuning Emotional TTS: [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB) and part of [laions_got_talent](https://huggingface.co/datasets/laion/laions_got_talent)(the part we use is also uploaded to [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB)).

	-->
	<!-- ## Acknowledgements
	- Our codes is built on [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM).
	- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) valuable repo.
	-->
	<!-- ## [Paper](https://arxiv.org/abs/2504.12867); [Demo Page](https://yanghaha0908.github.io/EmoVoice/); -->

	## Citation
	If our work is useful for you, please cite as:
	```
	@article{yang2025emovoice,
	title={EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting},
	author={Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and Niu, Zhikang and Liu, Wenrui and others},
	journal={arXiv preprint arXiv:2504.12867},
	year={2025}
	}
	```
	<!-- Paper link: https://arxiv.org/abs/2504.12867 -->
	<!-- ## License

	Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.

	-->