|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- tts |
|
|
- speech-synthesis |
|
|
- emotion-control |
|
|
--- |
|
|
|
|
|
# EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting |
|
|
|
|
|
EmoVoice is a novel emotion-controllable Text-to-Speech (TTS) model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design to enhance content consistency. |
|
|
|
|
|
This model was presented in the paper: [EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting](https://huggingface.co/papers/2504.12867). |
|
|
|
|
|
For more details, check out the [project page](https://yanghaha0908.github.io/EmoVoice/) and the [GitHub repository](https://github.com/yanghaha0908/EmoVoice). |
|
|
|
|
|
## Installation |
|
|
|
|
|
### Create a separate environment if needed |
|
|
|
|
|
```bash |
|
|
conda create -n EmoVoice python=3.10 |
|
|
conda activate EmoVoice |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
## Usage |
|
|
|
|
|
### Decode with checkpoints |
|
|
```bash |
|
|
bash examples/tts/scripts/inference_EmoVoice.sh |
|
|
bash examples/tts/scripts/inference_EmoVoice-PP.sh |
|
|
bash examples/tts/scripts/inference_EmoVoice_1.5B.sh |
|
|
``` |
|
|
## Train from scratch |
|
|
```bash |
|
|
# Fisrt Stage: Pretrain TTS |
|
|
bash examples/tts/scripts/pretrain_EmoVoice.sh |
|
|
bash examples/tts/scripts/pretrain_EmoVoice-PP.sh |
|
|
bash examples/tts/scripts/pretrain_EmoVoice_1.5B.sh |
|
|
|
|
|
# Second Stage: Finetune Emotional TTS |
|
|
bash examples/tts/scripts/ft_EmoVoice.sh |
|
|
bash examples/tts/scripts/ft_EmoVoice-PP.sh |
|
|
bash examples/tts/scripts/ft_EmoVoice_1.5B.sh |
|
|
``` |
|
|
|
|
|
## Checkpoints |
|
|
- Checkpoints can be found on Hugging Face: https://huggingface.co/yhaha/EmoVoice |
|
|
|
|
|
## Dataset |
|
|
|
|
|
- Pretrain TTS: [VoiceAssistant](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni) |
|
|
- Finetune Emotional TTS: [EmoVoice-DB](https://huggingface.co/datasets/yhaha/EmoVoice-DB) and part of [laions_got_talent](https://huggingface.co/datasets/laion/laions_got_talent) |
|
|
|
|
|
## Acknowledgements |
|
|
- Our codes is built on [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM) |
|
|
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) valuable repo |
|
|
|
|
|
## Citation |
|
|
|
|
|
If our work and codebase is useful for you, please cite as: |
|
|
``` |
|
|
@article{yang2025emovoice, |
|
|
title={EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting}, |
|
|
author={Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and Niu, Zhikang and Liu, Wenrui and others}, |
|
|
journal={arXiv preprint arXiv:2504.12867}, |
|
|
year={2025} |
|
|
} |
|
|
``` |