valkiscute commited on
Commit
8dde10a
Β·
verified Β·
1 Parent(s): 12dacbe

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.
2
+
3
+ ## Install
4
+ ### Clone & Install
5
+ ```sh
6
+ git clone https://github.com/MYZY-AI/Muyan-TTS.git
7
+ cd Muyan-TTS
8
+
9
+ conda create -n muyan-tts python=3.10 -y
10
+ conda activate muyan-tts
11
+ make build
12
+ ```
13
+
14
+ You need to install ```FFmpeg```. If you're using Ubuntu, you can install it with the following command:
15
+ ```sh
16
+ sudo apt update
17
+ sudo apt install ffmpeg
18
+ ```
19
+
20
+
21
+
22
+ Additionally, you need to download the weights of [chinese-hubert-base](https://huggingface.co/TencentGameMate/chinese-hubert-base).
23
+
24
+ Place all the downloaded models in the ```pretrained_models``` directory. Your directory structure should look similar to the following:
25
+ ```
26
+ pretrained_models
27
+ β”œβ”€β”€ chinese-hubert-base
28
+ β”œβ”€β”€ Muyan-TTS
29
+ └── Muyan-TTS-SFT
30
+ ```
31
+
32
+ ## Quickstart
33
+ ```sh
34
+ python tts.py
35
+ ```
36
+ This will synthesize speech through inference. The core code is as follows:
37
+ ```py
38
+ async def main(model_type, model_path):
39
+ tts = Inference(model_type, model_path, enable_vllm_acc=False)
40
+ wavs = await tts.generate(
41
+ ref_wav_path="assets/Claire.wav",
42
+ prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
43
+ text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
44
+ )
45
+ output_path = "logs/tts.wav"
46
+ with open(output_path, "wb") as f:
47
+ f.write(next(wavs))
48
+ print(f"Speech generated in {output_path}")
49
+ ```
50
+ You need to specify the ```prompt speech```, including the ```ref_wav_path``` and its ```prompt text```, and the ```text``` to be synthesized. The synthesized speech is saved by default to ```logs/tts.wav```.
51
+
52
+ Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.
53
+
54
+ When you specify the ```model_type``` to be ```base```, you can change the ```prompt speech``` to arbitrary speaker for zero-shot TTS synthesis.
55
+
56
+ When you specify the ```model_type``` to be ```sft```, you need to keep the ```prompt speech``` unchanged because the ```sft``` model is trained on Claire's voice.
57
+
58
+ ## API Usage
59
+ ```sh
60
+ python api.py
61
+ ```
62
+ Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port ```8020```. Additionally, LLM logs will be saved in ```logs/llm.log```.
63
+
64
+ You can send a request to the API using the example below:
65
+ ```py
66
+ import time
67
+ import requests
68
+ TTS_PORT=8020
69
+ payload = {
70
+ "ref_wav_path": "assets/Claire.wav",
71
+ "prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
72
+ "text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
73
+ }
74
+ start = time.time()
75
+
76
+ url = f"http://localhost:{TTS_PORT}/get_tts"
77
+ response = requests.post(url, json=payload)
78
+ audio_file_path = "logs/tts.wav"
79
+ with open(audio_file_path, "wb") as f:
80
+ f.write(response.content)
81
+
82
+ print(time.time() - start)
83
+ ```
84
+
85
+ By default, the synthesized speech will be saved at ```logs/tts.wav```.
86
+
87
+ Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.
88
+
89
+ ## Training
90
+
91
+ We use ```LibriSpeech``` as an example. You can use your own dataset instead, but you need to organize the data into the format shown in ```data_process/examples```.
92
+
93
+ If you haven't downloaded ```LibriSpeech``` yet, you can download the dev-clean set using:
94
+ ```sh
95
+ wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P path/to/save
96
+ ```
97
+ After downloading, specify the ```librispeech_dir``` in ```prepare_sft_dataset.py``` to match the download location. Then run ```./train.sh```, which will automatically process the data and generate ```data/tts_sft_data.json```. We will use the first speaker from the LibriSpeech subset for fine-tuning. You can also specify a different speaker as needed in ```data_process/text_format_conversion.py```.
98
+
99
+ Note that if an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun ```train.sh```.
100
+
101
+ After generating ```data/tts_sft_data.json```, train.sh will automatically copy it to ```llama-factory/data``` and add the following field to ```dataset_info.json```:
102
+ ```json
103
+ "tts_sft_data": {
104
+ "file_name": "tts_sft_data.json"
105
+ }
106
+ ```
107
+ Finally, it will automatically execute the ```llamafactory-cli train``` command to start training. You can adjust training settings using ```training/sft.yaml```. By default, the trained weights will be saved to ```pretrained_models/Muyan-TTS-new-SFT```.
108
+
109
+ You can directly deploy your trained model using the API tool above. During inference, you need to specify the ```model_type``` to be ```sft``` and replace the ```ref_wav_path``` and ```prompt text``` with a sample of the speaker's voice you trained on.
110
+