Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.
|
| 2 |
+
|
| 3 |
+
## Install
|
| 4 |
+
### Clone & Install
|
| 5 |
+
```sh
|
| 6 |
+
git clone https://github.com/MYZY-AI/Muyan-TTS.git
|
| 7 |
+
cd Muyan-TTS
|
| 8 |
+
|
| 9 |
+
conda create -n muyan-tts python=3.10 -y
|
| 10 |
+
conda activate muyan-tts
|
| 11 |
+
make build
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
You need to install ```FFmpeg```. If you're using Ubuntu, you can install it with the following command:
|
| 15 |
+
```sh
|
| 16 |
+
sudo apt update
|
| 17 |
+
sudo apt install ffmpeg
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
Additionally, you need to download the weights of [chinese-hubert-base](https://huggingface.co/TencentGameMate/chinese-hubert-base).
|
| 23 |
+
|
| 24 |
+
Place all the downloaded models in the ```pretrained_models``` directory. Your directory structure should look similar to the following:
|
| 25 |
+
```
|
| 26 |
+
pretrained_models
|
| 27 |
+
βββ chinese-hubert-base
|
| 28 |
+
βββ Muyan-TTS
|
| 29 |
+
βββ Muyan-TTS-SFT
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## Quickstart
|
| 33 |
+
```sh
|
| 34 |
+
python tts.py
|
| 35 |
+
```
|
| 36 |
+
This will synthesize speech through inference. The core code is as follows:
|
| 37 |
+
```py
|
| 38 |
+
async def main(model_type, model_path):
|
| 39 |
+
tts = Inference(model_type, model_path, enable_vllm_acc=False)
|
| 40 |
+
wavs = await tts.generate(
|
| 41 |
+
ref_wav_path="assets/Claire.wav",
|
| 42 |
+
prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
|
| 43 |
+
text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
|
| 44 |
+
)
|
| 45 |
+
output_path = "logs/tts.wav"
|
| 46 |
+
with open(output_path, "wb") as f:
|
| 47 |
+
f.write(next(wavs))
|
| 48 |
+
print(f"Speech generated in {output_path}")
|
| 49 |
+
```
|
| 50 |
+
You need to specify the ```prompt speech```, including the ```ref_wav_path``` and its ```prompt text```, and the ```text``` to be synthesized. The synthesized speech is saved by default to ```logs/tts.wav```.
|
| 51 |
+
|
| 52 |
+
Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.
|
| 53 |
+
|
| 54 |
+
When you specify the ```model_type``` to be ```base```, you can change the ```prompt speech``` to arbitrary speaker for zero-shot TTS synthesis.
|
| 55 |
+
|
| 56 |
+
When you specify the ```model_type``` to be ```sft```, you need to keep the ```prompt speech``` unchanged because the ```sft``` model is trained on Claire's voice.
|
| 57 |
+
|
| 58 |
+
## API Usage
|
| 59 |
+
```sh
|
| 60 |
+
python api.py
|
| 61 |
+
```
|
| 62 |
+
Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port ```8020```. Additionally, LLM logs will be saved in ```logs/llm.log```.
|
| 63 |
+
|
| 64 |
+
You can send a request to the API using the example below:
|
| 65 |
+
```py
|
| 66 |
+
import time
|
| 67 |
+
import requests
|
| 68 |
+
TTS_PORT=8020
|
| 69 |
+
payload = {
|
| 70 |
+
"ref_wav_path": "assets/Claire.wav",
|
| 71 |
+
"prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
|
| 72 |
+
"text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
|
| 73 |
+
}
|
| 74 |
+
start = time.time()
|
| 75 |
+
|
| 76 |
+
url = f"http://localhost:{TTS_PORT}/get_tts"
|
| 77 |
+
response = requests.post(url, json=payload)
|
| 78 |
+
audio_file_path = "logs/tts.wav"
|
| 79 |
+
with open(audio_file_path, "wb") as f:
|
| 80 |
+
f.write(response.content)
|
| 81 |
+
|
| 82 |
+
print(time.time() - start)
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
By default, the synthesized speech will be saved at ```logs/tts.wav```.
|
| 86 |
+
|
| 87 |
+
Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.
|
| 88 |
+
|
| 89 |
+
## Training
|
| 90 |
+
|
| 91 |
+
We use ```LibriSpeech``` as an example. You can use your own dataset instead, but you need to organize the data into the format shown in ```data_process/examples```.
|
| 92 |
+
|
| 93 |
+
If you haven't downloaded ```LibriSpeech``` yet, you can download the dev-clean set using:
|
| 94 |
+
```sh
|
| 95 |
+
wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P path/to/save
|
| 96 |
+
```
|
| 97 |
+
After downloading, specify the ```librispeech_dir``` in ```prepare_sft_dataset.py``` to match the download location. Then run ```./train.sh```, which will automatically process the data and generate ```data/tts_sft_data.json```. We will use the first speaker from the LibriSpeech subset for fine-tuning. You can also specify a different speaker as needed in ```data_process/text_format_conversion.py```.
|
| 98 |
+
|
| 99 |
+
Note that if an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun ```train.sh```.
|
| 100 |
+
|
| 101 |
+
After generating ```data/tts_sft_data.json```, train.sh will automatically copy it to ```llama-factory/data``` and add the following field to ```dataset_info.json```:
|
| 102 |
+
```json
|
| 103 |
+
"tts_sft_data": {
|
| 104 |
+
"file_name": "tts_sft_data.json"
|
| 105 |
+
}
|
| 106 |
+
```
|
| 107 |
+
Finally, it will automatically execute the ```llamafactory-cli train``` command to start training. You can adjust training settings using ```training/sft.yaml```. By default, the trained weights will be saved to ```pretrained_models/Muyan-TTS-new-SFT```.
|
| 108 |
+
|
| 109 |
+
You can directly deploy your trained model using the API tool above. During inference, you need to specify the ```model_type``` to be ```sft``` and replace the ```ref_wav_path``` and ```prompt text``` with a sample of the speaker's voice you trained on.
|
| 110 |
+
|