ngovanh commited on
Commit
97ea237
·
verified ·
1 Parent(s): f2bf2c6

Upload core.py

Browse files
Files changed (1) hide show
  1. core.py +2403 -0
core.py ADDED
@@ -0,0 +1,2403 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import json
4
+ import argparse
5
+ import subprocess
6
+ from functools import lru_cache
7
+ from distutils.util import strtobool
8
+
9
+ now_dir = os.getcwd()
10
+ sys.path.append(now_dir)
11
+
12
+ current_script_directory = os.path.dirname(os.path.realpath(__file__))
13
+ logs_path = os.path.join(current_script_directory, "logs")
14
+
15
+ from rvc.lib.tools.prerequisites_download import prequisites_download_pipeline
16
+ from rvc.train.process.model_blender import model_blender
17
+ from rvc.train.process.model_information import model_information
18
+ from rvc.lib.tools.analyzer import analyze_audio
19
+ from rvc.lib.tools.launch_tensorboard import launch_tensorboard_pipeline
20
+ from rvc.lib.tools.model_download import model_download_pipeline
21
+
22
+ python = sys.executable
23
+
24
+
25
+ # Get TTS Voices -> https://speech.platform.bing.com/consumer/speech/synthesize/readaloud/voices/list?trustedclienttoken=6A5AA1D4EAFF4E9FB37E23D68491D6F4
26
+ @lru_cache(maxsize=1) # Cache only one result since the file is static
27
+ def load_voices_data():
28
+ with open(
29
+ os.path.join("rvc", "lib", "tools", "tts_voices.json"), "r", encoding="utf-8"
30
+ ) as file:
31
+ return json.load(file)
32
+
33
+
34
+ voices_data = load_voices_data()
35
+ locales = list({voice["ShortName"] for voice in voices_data})
36
+
37
+
38
+ @lru_cache(maxsize=None)
39
+ def import_voice_converter():
40
+ from rvc.infer.infer import VoiceConverter
41
+
42
+ return VoiceConverter()
43
+
44
+
45
+ @lru_cache(maxsize=1)
46
+ def get_config():
47
+ from rvc.configs.config import Config
48
+
49
+ return Config()
50
+
51
+
52
+ # Infer
53
+ def run_infer_script(
54
+ pitch: int,
55
+ index_rate: float,
56
+ volume_envelope: int,
57
+ protect: float,
58
+ hop_length: int,
59
+ f0_method: str,
60
+ input_path: str,
61
+ output_path: str,
62
+ pth_path: str,
63
+ index_path: str,
64
+ split_audio: bool,
65
+ f0_autotune: bool,
66
+ f0_autotune_strength: float,
67
+ clean_audio: bool,
68
+ clean_strength: float,
69
+ export_format: str,
70
+ f0_file: str,
71
+ embedder_model: str,
72
+ embedder_model_custom: str = None,
73
+ formant_shifting: bool = False,
74
+ formant_qfrency: float = 1.0,
75
+ formant_timbre: float = 1.0,
76
+ post_process: bool = False,
77
+ reverb: bool = False,
78
+ pitch_shift: bool = False,
79
+ limiter: bool = False,
80
+ gain: bool = False,
81
+ distortion: bool = False,
82
+ chorus: bool = False,
83
+ bitcrush: bool = False,
84
+ clipping: bool = False,
85
+ compressor: bool = False,
86
+ delay: bool = False,
87
+ reverb_room_size: float = 0.5,
88
+ reverb_damping: float = 0.5,
89
+ reverb_wet_gain: float = 0.5,
90
+ reverb_dry_gain: float = 0.5,
91
+ reverb_width: float = 0.5,
92
+ reverb_freeze_mode: float = 0.5,
93
+ pitch_shift_semitones: float = 0.0,
94
+ limiter_threshold: float = -6,
95
+ limiter_release_time: float = 0.01,
96
+ gain_db: float = 0.0,
97
+ distortion_gain: float = 25,
98
+ chorus_rate: float = 1.0,
99
+ chorus_depth: float = 0.25,
100
+ chorus_center_delay: float = 7,
101
+ chorus_feedback: float = 0.0,
102
+ chorus_mix: float = 0.5,
103
+ bitcrush_bit_depth: int = 8,
104
+ clipping_threshold: float = -6,
105
+ compressor_threshold: float = 0,
106
+ compressor_ratio: float = 1,
107
+ compressor_attack: float = 1.0,
108
+ compressor_release: float = 100,
109
+ delay_seconds: float = 0.5,
110
+ delay_feedback: float = 0.0,
111
+ delay_mix: float = 0.5,
112
+ sid: int = 0,
113
+ ):
114
+ kwargs = {
115
+ "audio_input_path": input_path,
116
+ "audio_output_path": output_path,
117
+ "model_path": pth_path,
118
+ "index_path": index_path,
119
+ "pitch": pitch,
120
+ "index_rate": index_rate,
121
+ "volume_envelope": volume_envelope,
122
+ "protect": protect,
123
+ "hop_length": hop_length,
124
+ "f0_method": f0_method,
125
+ "pth_path": pth_path,
126
+ "index_path": index_path,
127
+ "split_audio": split_audio,
128
+ "f0_autotune": f0_autotune,
129
+ "f0_autotune_strength": f0_autotune_strength,
130
+ "clean_audio": clean_audio,
131
+ "clean_strength": clean_strength,
132
+ "export_format": export_format,
133
+ "f0_file": f0_file,
134
+ "embedder_model": embedder_model,
135
+ "embedder_model_custom": embedder_model_custom,
136
+ "post_process": post_process,
137
+ "formant_shifting": formant_shifting,
138
+ "formant_qfrency": formant_qfrency,
139
+ "formant_timbre": formant_timbre,
140
+ "reverb": reverb,
141
+ "pitch_shift": pitch_shift,
142
+ "limiter": limiter,
143
+ "gain": gain,
144
+ "distortion": distortion,
145
+ "chorus": chorus,
146
+ "bitcrush": bitcrush,
147
+ "clipping": clipping,
148
+ "compressor": compressor,
149
+ "delay": delay,
150
+ "reverb_room_size": reverb_room_size,
151
+ "reverb_damping": reverb_damping,
152
+ "reverb_wet_level": reverb_wet_gain,
153
+ "reverb_dry_level": reverb_dry_gain,
154
+ "reverb_width": reverb_width,
155
+ "reverb_freeze_mode": reverb_freeze_mode,
156
+ "pitch_shift_semitones": pitch_shift_semitones,
157
+ "limiter_threshold": limiter_threshold,
158
+ "limiter_release": limiter_release_time,
159
+ "gain_db": gain_db,
160
+ "distortion_gain": distortion_gain,
161
+ "chorus_rate": chorus_rate,
162
+ "chorus_depth": chorus_depth,
163
+ "chorus_delay": chorus_center_delay,
164
+ "chorus_feedback": chorus_feedback,
165
+ "chorus_mix": chorus_mix,
166
+ "bitcrush_bit_depth": bitcrush_bit_depth,
167
+ "clipping_threshold": clipping_threshold,
168
+ "compressor_threshold": compressor_threshold,
169
+ "compressor_ratio": compressor_ratio,
170
+ "compressor_attack": compressor_attack,
171
+ "compressor_release": compressor_release,
172
+ "delay_seconds": delay_seconds,
173
+ "delay_feedback": delay_feedback,
174
+ "delay_mix": delay_mix,
175
+ "sid": sid,
176
+ }
177
+ infer_pipeline = import_voice_converter()
178
+ infer_pipeline.convert_audio(
179
+ **kwargs,
180
+ )
181
+ return f"File {input_path} inferred successfully.", output_path.replace(
182
+ ".wav", f".{export_format.lower()}"
183
+ )
184
+
185
+
186
+ # Batch infer
187
+ def run_batch_infer_script(
188
+ pitch: int,
189
+ index_rate: float,
190
+ volume_envelope: int,
191
+ protect: float,
192
+ hop_length: int,
193
+ f0_method: str,
194
+ input_folder: str,
195
+ output_folder: str,
196
+ pth_path: str,
197
+ index_path: str,
198
+ split_audio: bool,
199
+ f0_autotune: bool,
200
+ f0_autotune_strength: float,
201
+ clean_audio: bool,
202
+ clean_strength: float,
203
+ export_format: str,
204
+ f0_file: str,
205
+ embedder_model: str,
206
+ embedder_model_custom: str = None,
207
+ formant_shifting: bool = False,
208
+ formant_qfrency: float = 1.0,
209
+ formant_timbre: float = 1.0,
210
+ post_process: bool = False,
211
+ reverb: bool = False,
212
+ pitch_shift: bool = False,
213
+ limiter: bool = False,
214
+ gain: bool = False,
215
+ distortion: bool = False,
216
+ chorus: bool = False,
217
+ bitcrush: bool = False,
218
+ clipping: bool = False,
219
+ compressor: bool = False,
220
+ delay: bool = False,
221
+ reverb_room_size: float = 0.5,
222
+ reverb_damping: float = 0.5,
223
+ reverb_wet_gain: float = 0.5,
224
+ reverb_dry_gain: float = 0.5,
225
+ reverb_width: float = 0.5,
226
+ reverb_freeze_mode: float = 0.5,
227
+ pitch_shift_semitones: float = 0.0,
228
+ limiter_threshold: float = -6,
229
+ limiter_release_time: float = 0.01,
230
+ gain_db: float = 0.0,
231
+ distortion_gain: float = 25,
232
+ chorus_rate: float = 1.0,
233
+ chorus_depth: float = 0.25,
234
+ chorus_center_delay: float = 7,
235
+ chorus_feedback: float = 0.0,
236
+ chorus_mix: float = 0.5,
237
+ bitcrush_bit_depth: int = 8,
238
+ clipping_threshold: float = -6,
239
+ compressor_threshold: float = 0,
240
+ compressor_ratio: float = 1,
241
+ compressor_attack: float = 1.0,
242
+ compressor_release: float = 100,
243
+ delay_seconds: float = 0.5,
244
+ delay_feedback: float = 0.0,
245
+ delay_mix: float = 0.5,
246
+ sid: int = 0,
247
+ ):
248
+ kwargs = {
249
+ "audio_input_paths": input_folder,
250
+ "audio_output_path": output_folder,
251
+ "model_path": pth_path,
252
+ "index_path": index_path,
253
+ "pitch": pitch,
254
+ "index_rate": index_rate,
255
+ "volume_envelope": volume_envelope,
256
+ "protect": protect,
257
+ "hop_length": hop_length,
258
+ "f0_method": f0_method,
259
+ "pth_path": pth_path,
260
+ "index_path": index_path,
261
+ "split_audio": split_audio,
262
+ "f0_autotune": f0_autotune,
263
+ "f0_autotune_strength": f0_autotune_strength,
264
+ "clean_audio": clean_audio,
265
+ "clean_strength": clean_strength,
266
+ "export_format": export_format,
267
+ "f0_file": f0_file,
268
+ "embedder_model": embedder_model,
269
+ "embedder_model_custom": embedder_model_custom,
270
+ "post_process": post_process,
271
+ "formant_shifting": formant_shifting,
272
+ "formant_qfrency": formant_qfrency,
273
+ "formant_timbre": formant_timbre,
274
+ "reverb": reverb,
275
+ "pitch_shift": pitch_shift,
276
+ "limiter": limiter,
277
+ "gain": gain,
278
+ "distortion": distortion,
279
+ "chorus": chorus,
280
+ "bitcrush": bitcrush,
281
+ "clipping": clipping,
282
+ "compressor": compressor,
283
+ "delay": delay,
284
+ "reverb_room_size": reverb_room_size,
285
+ "reverb_damping": reverb_damping,
286
+ "reverb_wet_level": reverb_wet_gain,
287
+ "reverb_dry_level": reverb_dry_gain,
288
+ "reverb_width": reverb_width,
289
+ "reverb_freeze_mode": reverb_freeze_mode,
290
+ "pitch_shift_semitones": pitch_shift_semitones,
291
+ "limiter_threshold": limiter_threshold,
292
+ "limiter_release": limiter_release_time,
293
+ "gain_db": gain_db,
294
+ "distortion_gain": distortion_gain,
295
+ "chorus_rate": chorus_rate,
296
+ "chorus_depth": chorus_depth,
297
+ "chorus_delay": chorus_center_delay,
298
+ "chorus_feedback": chorus_feedback,
299
+ "chorus_mix": chorus_mix,
300
+ "bitcrush_bit_depth": bitcrush_bit_depth,
301
+ "clipping_threshold": clipping_threshold,
302
+ "compressor_threshold": compressor_threshold,
303
+ "compressor_ratio": compressor_ratio,
304
+ "compressor_attack": compressor_attack,
305
+ "compressor_release": compressor_release,
306
+ "delay_seconds": delay_seconds,
307
+ "delay_feedback": delay_feedback,
308
+ "delay_mix": delay_mix,
309
+ "sid": sid,
310
+ }
311
+ infer_pipeline = import_voice_converter()
312
+ infer_pipeline.convert_audio_batch(
313
+ **kwargs,
314
+ )
315
+
316
+ return f"Files from {input_folder} inferred successfully."
317
+
318
+
319
+ # TTS
320
+ def run_tts_script(
321
+ tts_file: str,
322
+ tts_text: str,
323
+ tts_voice: str,
324
+ tts_rate: int,
325
+ pitch: int,
326
+ index_rate: float,
327
+ volume_envelope: int,
328
+ protect: float,
329
+ hop_length: int,
330
+ f0_method: str,
331
+ output_tts_path: str,
332
+ output_rvc_path: str,
333
+ pth_path: str,
334
+ index_path: str,
335
+ split_audio: bool,
336
+ f0_autotune: bool,
337
+ f0_autotune_strength: float,
338
+ clean_audio: bool,
339
+ clean_strength: float,
340
+ export_format: str,
341
+ f0_file: str,
342
+ embedder_model: str,
343
+ embedder_model_custom: str = None,
344
+ sid: int = 0,
345
+ ):
346
+
347
+ tts_script_path = os.path.join("rvc", "lib", "tools", "tts.py")
348
+
349
+ if os.path.exists(output_tts_path) and os.path.abspath(output_tts_path).startswith(
350
+ os.path.abspath("assets")
351
+ ):
352
+ os.remove(output_tts_path)
353
+
354
+ command_tts = [
355
+ *map(
356
+ str,
357
+ [
358
+ python,
359
+ tts_script_path,
360
+ tts_file,
361
+ tts_text,
362
+ tts_voice,
363
+ tts_rate,
364
+ output_tts_path,
365
+ ],
366
+ ),
367
+ ]
368
+ subprocess.run(command_tts)
369
+ infer_pipeline = import_voice_converter()
370
+ infer_pipeline.convert_audio(
371
+ pitch=pitch,
372
+ index_rate=index_rate,
373
+ volume_envelope=volume_envelope,
374
+ protect=protect,
375
+ hop_length=hop_length,
376
+ f0_method=f0_method,
377
+ audio_input_path=output_tts_path,
378
+ audio_output_path=output_rvc_path,
379
+ model_path=pth_path,
380
+ index_path=index_path,
381
+ split_audio=split_audio,
382
+ f0_autotune=f0_autotune,
383
+ f0_autotune_strength=f0_autotune_strength,
384
+ clean_audio=clean_audio,
385
+ clean_strength=clean_strength,
386
+ export_format=export_format,
387
+ f0_file=f0_file,
388
+ embedder_model=embedder_model,
389
+ embedder_model_custom=embedder_model_custom,
390
+ sid=sid,
391
+ formant_shifting=None,
392
+ formant_qfrency=None,
393
+ formant_timbre=None,
394
+ post_process=None,
395
+ reverb=None,
396
+ pitch_shift=None,
397
+ limiter=None,
398
+ gain=None,
399
+ distortion=None,
400
+ chorus=None,
401
+ bitcrush=None,
402
+ clipping=None,
403
+ compressor=None,
404
+ delay=None,
405
+ sliders=None,
406
+ )
407
+
408
+ return f"Text {tts_text} synthesized successfully.", output_rvc_path.replace(
409
+ ".wav", f".{export_format.lower()}"
410
+ )
411
+
412
+
413
+ # Preprocess
414
+ def run_preprocess_script(
415
+ model_name: str,
416
+ dataset_path: str,
417
+ sample_rate: int,
418
+ cpu_cores: int,
419
+ cut_preprocess: str,
420
+ process_effects: bool,
421
+ noise_reduction: bool,
422
+ clean_strength: float,
423
+ chunk_len: float,
424
+ overlap_len: float,
425
+ ):
426
+ preprocess_script_path = os.path.join("rvc", "train", "preprocess", "preprocess.py")
427
+ command = [
428
+ python,
429
+ preprocess_script_path,
430
+ *map(
431
+ str,
432
+ [
433
+ os.path.join(logs_path, model_name),
434
+ dataset_path,
435
+ sample_rate,
436
+ cpu_cores,
437
+ cut_preprocess,
438
+ process_effects,
439
+ noise_reduction,
440
+ clean_strength,
441
+ chunk_len,
442
+ overlap_len,
443
+ ],
444
+ ),
445
+ ]
446
+ subprocess.run(command)
447
+ return f"Model {model_name} preprocessed successfully."
448
+
449
+
450
+ # Extract
451
+ def run_extract_script(
452
+ model_name: str,
453
+ f0_method: str,
454
+ hop_length: int,
455
+ cpu_cores: int,
456
+ gpu: int,
457
+ sample_rate: int,
458
+ embedder_model: str,
459
+ embedder_model_custom: str = None,
460
+ include_mutes: int = 2,
461
+ ):
462
+
463
+ model_path = os.path.join(logs_path, model_name)
464
+ extract = os.path.join("rvc", "train", "extract", "extract.py")
465
+
466
+ command_1 = [
467
+ python,
468
+ extract,
469
+ *map(
470
+ str,
471
+ [
472
+ model_path,
473
+ f0_method,
474
+ hop_length,
475
+ cpu_cores,
476
+ gpu,
477
+ sample_rate,
478
+ embedder_model,
479
+ embedder_model_custom,
480
+ include_mutes,
481
+ ],
482
+ ),
483
+ ]
484
+
485
+ subprocess.run(command_1)
486
+
487
+ return f"Model {model_name} extracted successfully."
488
+
489
+
490
+ # Train
491
+ def run_train_script(
492
+ model_name: str,
493
+ save_every_epoch: int,
494
+ save_only_latest: bool,
495
+ save_every_weights: bool,
496
+ total_epoch: int,
497
+ sample_rate: int,
498
+ batch_size: int,
499
+ gpu: int,
500
+ overtraining_detector: bool,
501
+ overtraining_threshold: int,
502
+ pretrained: bool,
503
+ cleanup: bool,
504
+ index_algorithm: str = "Auto",
505
+ cache_data_in_gpu: bool = False,
506
+ custom_pretrained: bool = False,
507
+ g_pretrained_path: str = None,
508
+ d_pretrained_path: str = None,
509
+ vocoder: str = "HiFi-GAN",
510
+ checkpointing: bool = False,
511
+ ):
512
+
513
+ if pretrained == True:
514
+ from rvc.lib.tools.pretrained_selector import pretrained_selector
515
+
516
+ if custom_pretrained == False:
517
+ pg, pd = pretrained_selector(str(vocoder), int(sample_rate))
518
+ else:
519
+ if g_pretrained_path is None or d_pretrained_path is None:
520
+ raise ValueError(
521
+ "Please provide the path to the pretrained G and D models."
522
+ )
523
+ pg, pd = g_pretrained_path, d_pretrained_path
524
+ else:
525
+ pg, pd = "", ""
526
+
527
+ train_script_path = os.path.join("rvc", "train", "train.py")
528
+ command = [
529
+ python,
530
+ train_script_path,
531
+ *map(
532
+ str,
533
+ [
534
+ model_name,
535
+ save_every_epoch,
536
+ total_epoch,
537
+ pg,
538
+ pd,
539
+ gpu,
540
+ batch_size,
541
+ sample_rate,
542
+ save_only_latest,
543
+ save_every_weights,
544
+ cache_data_in_gpu,
545
+ overtraining_detector,
546
+ overtraining_threshold,
547
+ cleanup,
548
+ vocoder,
549
+ checkpointing,
550
+ ],
551
+ ),
552
+ ]
553
+ subprocess.run(command)
554
+ run_index_script(model_name, index_algorithm)
555
+ return f"Model {model_name} trained successfully."
556
+
557
+
558
+ # Index
559
+ def run_index_script(model_name: str, index_algorithm: str):
560
+ index_script_path = os.path.join("rvc", "train", "process", "extract_index.py")
561
+ command = [
562
+ python,
563
+ index_script_path,
564
+ os.path.join(logs_path, model_name),
565
+ index_algorithm,
566
+ ]
567
+
568
+ subprocess.run(command)
569
+ return f"Index file for {model_name} generated successfully."
570
+
571
+
572
+ # Model information
573
+ def run_model_information_script(pth_path: str):
574
+ print(model_information(pth_path))
575
+ return model_information(pth_path)
576
+
577
+
578
+ # Model blender
579
+ def run_model_blender_script(
580
+ model_name: str, pth_path_1: str, pth_path_2: str, ratio: float
581
+ ):
582
+ message, model_blended = model_blender(model_name, pth_path_1, pth_path_2, ratio)
583
+ return message, model_blended
584
+
585
+
586
+ # Tensorboard
587
+ def run_tensorboard_script():
588
+ launch_tensorboard_pipeline()
589
+
590
+
591
+ # Download
592
+ def run_download_script(model_link: str):
593
+ model_download_pipeline(model_link)
594
+ return f"Model downloaded successfully."
595
+
596
+
597
+ # Prerequisites
598
+ def run_prerequisites_script(
599
+ pretraineds_hifigan: bool,
600
+ models: bool,
601
+ exe: bool,
602
+ ):
603
+ prequisites_download_pipeline(
604
+ pretraineds_hifigan,
605
+ models,
606
+ exe,
607
+ )
608
+ return "Prerequisites installed successfully."
609
+
610
+
611
+ # Audio analyzer
612
+ def run_audio_analyzer_script(
613
+ input_path: str, save_plot_path: str = "logs/audio_analysis.png"
614
+ ):
615
+ audio_info, plot_path = analyze_audio(input_path, save_plot_path)
616
+ print(
617
+ f"Audio info of {input_path}: {audio_info}",
618
+ f"Audio file {input_path} analyzed successfully. Plot saved at: {plot_path}",
619
+ )
620
+ return audio_info, plot_path
621
+
622
+
623
+ # Parse arguments
624
+ def parse_arguments():
625
+ parser = argparse.ArgumentParser(
626
+ description="Run the main.py script with specific parameters."
627
+ )
628
+ subparsers = parser.add_subparsers(
629
+ title="subcommands", dest="mode", help="Choose a mode"
630
+ )
631
+
632
+ # Parser for 'infer' mode
633
+ infer_parser = subparsers.add_parser("infer", help="Run inference")
634
+ pitch_description = (
635
+ "Set the pitch of the audio. Higher values result in a higher pitch."
636
+ )
637
+ infer_parser.add_argument(
638
+ "--pitch",
639
+ type=int,
640
+ help=pitch_description,
641
+ choices=range(-24, 25),
642
+ default=0,
643
+ )
644
+ index_rate_description = "Control the influence of the index file on the output. Higher values mean stronger influence. Lower values can help reduce artifacts but may result in less accurate voice cloning."
645
+ infer_parser.add_argument(
646
+ "--index_rate",
647
+ type=float,
648
+ help=index_rate_description,
649
+ choices=[i / 100.0 for i in range(0, 101)],
650
+ default=0.3,
651
+ )
652
+ volume_envelope_description = "Control the blending of the output's volume envelope. A value of 1 means the output envelope is fully used."
653
+ infer_parser.add_argument(
654
+ "--volume_envelope",
655
+ type=float,
656
+ help=volume_envelope_description,
657
+ choices=[i / 100.0 for i in range(0, 101)],
658
+ default=1,
659
+ )
660
+ protect_description = "Protect consonants and breathing sounds from artifacts. A value of 0.5 offers the strongest protection, while lower values may reduce the protection level but potentially mitigate the indexing effect."
661
+ infer_parser.add_argument(
662
+ "--protect",
663
+ type=float,
664
+ help=protect_description,
665
+ choices=[i / 1000.0 for i in range(0, 501)],
666
+ default=0.33,
667
+ )
668
+ hop_length_description = "Only applicable for the Crepe pitch extraction method. Determines the time it takes for the system to react to a significant pitch change. Smaller values require more processing time but can lead to better pitch accuracy."
669
+ infer_parser.add_argument(
670
+ "--hop_length",
671
+ type=int,
672
+ help=hop_length_description,
673
+ choices=range(1, 513),
674
+ default=128,
675
+ )
676
+ f0_method_description = "Choose the pitch extraction algorithm for the conversion. 'rmvpe' is the default and generally recommended."
677
+ infer_parser.add_argument(
678
+ "--f0_method",
679
+ type=str,
680
+ help=f0_method_description,
681
+ choices=[
682
+ "crepe",
683
+ "crepe-tiny",
684
+ "rmvpe",
685
+ "fcpe",
686
+ "hybrid[crepe+rmvpe]",
687
+ "hybrid[crepe+fcpe]",
688
+ "hybrid[rmvpe+fcpe]",
689
+ "hybrid[crepe+rmvpe+fcpe]",
690
+ ],
691
+ default="rmvpe",
692
+ )
693
+ infer_parser.add_argument(
694
+ "--input_path",
695
+ type=str,
696
+ help="Full path to the input audio file.",
697
+ required=True,
698
+ )
699
+ infer_parser.add_argument(
700
+ "--output_path",
701
+ type=str,
702
+ help="Full path to the output audio file.",
703
+ required=True,
704
+ )
705
+ pth_path_description = "Full path to the RVC model file (.pth)."
706
+ infer_parser.add_argument(
707
+ "--pth_path", type=str, help=pth_path_description, required=True
708
+ )
709
+ index_path_description = "Full path to the index file (.index)."
710
+ infer_parser.add_argument(
711
+ "--index_path", type=str, help=index_path_description, required=True
712
+ )
713
+ split_audio_description = "Split the audio into smaller segments before inference. This can improve the quality of the output for longer audio files."
714
+ infer_parser.add_argument(
715
+ "--split_audio",
716
+ type=lambda x: bool(strtobool(x)),
717
+ choices=[True, False],
718
+ help=split_audio_description,
719
+ default=False,
720
+ )
721
+ f0_autotune_description = "Apply a light autotune to the inferred audio. Particularly useful for singing voice conversions."
722
+ infer_parser.add_argument(
723
+ "--f0_autotune",
724
+ type=lambda x: bool(strtobool(x)),
725
+ choices=[True, False],
726
+ help=f0_autotune_description,
727
+ default=False,
728
+ )
729
+ f0_autotune_strength_description = "Set the autotune strength - the more you increase it the more it will snap to the chromatic grid."
730
+ infer_parser.add_argument(
731
+ "--f0_autotune_strength",
732
+ type=float,
733
+ help=f0_autotune_strength_description,
734
+ choices=[(i / 10) for i in range(11)],
735
+ default=1.0,
736
+ )
737
+ clean_audio_description = "Clean the output audio using noise reduction algorithms. Recommended for speech conversions."
738
+ infer_parser.add_argument(
739
+ "--clean_audio",
740
+ type=lambda x: bool(strtobool(x)),
741
+ choices=[True, False],
742
+ help=clean_audio_description,
743
+ default=False,
744
+ )
745
+ clean_strength_description = "Adjust the intensity of the audio cleaning process. Higher values result in stronger cleaning, but may lead to a more compressed sound."
746
+ infer_parser.add_argument(
747
+ "--clean_strength",
748
+ type=float,
749
+ help=clean_strength_description,
750
+ choices=[(i / 10) for i in range(11)],
751
+ default=0.7,
752
+ )
753
+ export_format_description = "Select the desired output audio format."
754
+ infer_parser.add_argument(
755
+ "--export_format",
756
+ type=str,
757
+ help=export_format_description,
758
+ choices=["WAV", "MP3", "FLAC", "OGG", "M4A"],
759
+ default="WAV",
760
+ )
761
+ embedder_model_description = (
762
+ "Choose the model used for generating speaker embeddings."
763
+ )
764
+ infer_parser.add_argument(
765
+ "--embedder_model",
766
+ type=str,
767
+ help=embedder_model_description,
768
+ choices=[
769
+ "contentvec",
770
+ "chinese-hubert-base",
771
+ "japanese-hubert-base",
772
+ "korean-hubert-base",
773
+ "custom",
774
+ ],
775
+ default="contentvec",
776
+ )
777
+ embedder_model_custom_description = "Specify the path to a custom model for speaker embedding. Only applicable if 'embedder_model' is set to 'custom'."
778
+ infer_parser.add_argument(
779
+ "--embedder_model_custom",
780
+ type=str,
781
+ help=embedder_model_custom_description,
782
+ default=None,
783
+ )
784
+ f0_file_description = "Full path to an external F0 file (.f0). This allows you to use pre-computed pitch values for the input audio."
785
+ infer_parser.add_argument(
786
+ "--f0_file",
787
+ type=str,
788
+ help=f0_file_description,
789
+ default=None,
790
+ )
791
+ formant_shifting_description = "Apply formant shifting to the input audio. This can help adjust the timbre of the voice."
792
+ infer_parser.add_argument(
793
+ "--formant_shifting",
794
+ type=lambda x: bool(strtobool(x)),
795
+ choices=[True, False],
796
+ help=formant_shifting_description,
797
+ default=False,
798
+ required=False,
799
+ )
800
+ formant_qfrency_description = "Control the frequency of the formant shifting effect. Higher values result in a more pronounced effect."
801
+ infer_parser.add_argument(
802
+ "--formant_qfrency",
803
+ type=float,
804
+ help=formant_qfrency_description,
805
+ default=1.0,
806
+ required=False,
807
+ )
808
+ formant_timbre_description = "Control the timbre of the formant shifting effect. Higher values result in a more pronounced effect."
809
+ infer_parser.add_argument(
810
+ "--formant_timbre",
811
+ type=float,
812
+ help=formant_timbre_description,
813
+ default=1.0,
814
+ required=False,
815
+ )
816
+ sid_description = "Speaker ID for multi-speaker models."
817
+ infer_parser.add_argument(
818
+ "--sid",
819
+ type=int,
820
+ help=sid_description,
821
+ default=0,
822
+ required=False,
823
+ )
824
+ post_process_description = "Apply post-processing effects to the output audio."
825
+ infer_parser.add_argument(
826
+ "--post_process",
827
+ type=lambda x: bool(strtobool(x)),
828
+ choices=[True, False],
829
+ help=post_process_description,
830
+ default=False,
831
+ required=False,
832
+ )
833
+ reverb_description = "Apply reverb effect to the output audio."
834
+ infer_parser.add_argument(
835
+ "--reverb",
836
+ type=lambda x: bool(strtobool(x)),
837
+ choices=[True, False],
838
+ help=reverb_description,
839
+ default=False,
840
+ required=False,
841
+ )
842
+
843
+ pitch_shift_description = "Apply pitch shifting effect to the output audio."
844
+ infer_parser.add_argument(
845
+ "--pitch_shift",
846
+ type=lambda x: bool(strtobool(x)),
847
+ choices=[True, False],
848
+ help=pitch_shift_description,
849
+ default=False,
850
+ required=False,
851
+ )
852
+
853
+ limiter_description = "Apply limiter effect to the output audio."
854
+ infer_parser.add_argument(
855
+ "--limiter",
856
+ type=lambda x: bool(strtobool(x)),
857
+ choices=[True, False],
858
+ help=limiter_description,
859
+ default=False,
860
+ required=False,
861
+ )
862
+
863
+ gain_description = "Apply gain effect to the output audio."
864
+ infer_parser.add_argument(
865
+ "--gain",
866
+ type=lambda x: bool(strtobool(x)),
867
+ choices=[True, False],
868
+ help=gain_description,
869
+ default=False,
870
+ required=False,
871
+ )
872
+
873
+ distortion_description = "Apply distortion effect to the output audio."
874
+ infer_parser.add_argument(
875
+ "--distortion",
876
+ type=lambda x: bool(strtobool(x)),
877
+ choices=[True, False],
878
+ help=distortion_description,
879
+ default=False,
880
+ required=False,
881
+ )
882
+
883
+ chorus_description = "Apply chorus effect to the output audio."
884
+ infer_parser.add_argument(
885
+ "--chorus",
886
+ type=lambda x: bool(strtobool(x)),
887
+ choices=[True, False],
888
+ help=chorus_description,
889
+ default=False,
890
+ required=False,
891
+ )
892
+
893
+ bitcrush_description = "Apply bitcrush effect to the output audio."
894
+ infer_parser.add_argument(
895
+ "--bitcrush",
896
+ type=lambda x: bool(strtobool(x)),
897
+ choices=[True, False],
898
+ help=bitcrush_description,
899
+ default=False,
900
+ required=False,
901
+ )
902
+
903
+ clipping_description = "Apply clipping effect to the output audio."
904
+ infer_parser.add_argument(
905
+ "--clipping",
906
+ type=lambda x: bool(strtobool(x)),
907
+ choices=[True, False],
908
+ help=clipping_description,
909
+ default=False,
910
+ required=False,
911
+ )
912
+
913
+ compressor_description = "Apply compressor effect to the output audio."
914
+ infer_parser.add_argument(
915
+ "--compressor",
916
+ type=lambda x: bool(strtobool(x)),
917
+ choices=[True, False],
918
+ help=compressor_description,
919
+ default=False,
920
+ required=False,
921
+ )
922
+
923
+ delay_description = "Apply delay effect to the output audio."
924
+ infer_parser.add_argument(
925
+ "--delay",
926
+ type=lambda x: bool(strtobool(x)),
927
+ choices=[True, False],
928
+ help=delay_description,
929
+ default=False,
930
+ required=False,
931
+ )
932
+
933
+ reverb_room_size_description = "Control the room size of the reverb effect. Higher values result in a larger room size."
934
+ infer_parser.add_argument(
935
+ "--reverb_room_size",
936
+ type=float,
937
+ help=reverb_room_size_description,
938
+ default=0.5,
939
+ required=False,
940
+ )
941
+
942
+ reverb_damping_description = "Control the damping of the reverb effect. Higher values result in a more damped sound."
943
+ infer_parser.add_argument(
944
+ "--reverb_damping",
945
+ type=float,
946
+ help=reverb_damping_description,
947
+ default=0.5,
948
+ required=False,
949
+ )
950
+
951
+ reverb_wet_gain_description = "Control the wet gain of the reverb effect. Higher values result in a stronger reverb effect."
952
+ infer_parser.add_argument(
953
+ "--reverb_wet_gain",
954
+ type=float,
955
+ help=reverb_wet_gain_description,
956
+ default=0.5,
957
+ required=False,
958
+ )
959
+
960
+ reverb_dry_gain_description = "Control the dry gain of the reverb effect. Higher values result in a stronger dry signal."
961
+ infer_parser.add_argument(
962
+ "--reverb_dry_gain",
963
+ type=float,
964
+ help=reverb_dry_gain_description,
965
+ default=0.5,
966
+ required=False,
967
+ )
968
+
969
+ reverb_width_description = "Control the stereo width of the reverb effect. Higher values result in a wider stereo image."
970
+ infer_parser.add_argument(
971
+ "--reverb_width",
972
+ type=float,
973
+ help=reverb_width_description,
974
+ default=0.5,
975
+ required=False,
976
+ )
977
+
978
+ reverb_freeze_mode_description = "Control the freeze mode of the reverb effect. Higher values result in a stronger freeze effect."
979
+ infer_parser.add_argument(
980
+ "--reverb_freeze_mode",
981
+ type=float,
982
+ help=reverb_freeze_mode_description,
983
+ default=0.5,
984
+ required=False,
985
+ )
986
+
987
+ pitch_shift_semitones_description = "Control the pitch shift in semitones. Positive values increase the pitch, while negative values decrease it."
988
+ infer_parser.add_argument(
989
+ "--pitch_shift_semitones",
990
+ type=float,
991
+ help=pitch_shift_semitones_description,
992
+ default=0.0,
993
+ required=False,
994
+ )
995
+
996
+ limiter_threshold_description = "Control the threshold of the limiter effect. Higher values result in a stronger limiting effect."
997
+ infer_parser.add_argument(
998
+ "--limiter_threshold",
999
+ type=float,
1000
+ help=limiter_threshold_description,
1001
+ default=-6,
1002
+ required=False,
1003
+ )
1004
+
1005
+ limiter_release_time_description = "Control the release time of the limiter effect. Higher values result in a longer release time."
1006
+ infer_parser.add_argument(
1007
+ "--limiter_release_time",
1008
+ type=float,
1009
+ help=limiter_release_time_description,
1010
+ default=0.01,
1011
+ required=False,
1012
+ )
1013
+
1014
+ gain_db_description = "Control the gain in decibels. Positive values increase the gain, while negative values decrease it."
1015
+ infer_parser.add_argument(
1016
+ "--gain_db",
1017
+ type=float,
1018
+ help=gain_db_description,
1019
+ default=0.0,
1020
+ required=False,
1021
+ )
1022
+
1023
+ distortion_gain_description = "Control the gain of the distortion effect. Higher values result in a stronger distortion effect."
1024
+ infer_parser.add_argument(
1025
+ "--distortion_gain",
1026
+ type=float,
1027
+ help=distortion_gain_description,
1028
+ default=25,
1029
+ required=False,
1030
+ )
1031
+
1032
+ chorus_rate_description = "Control the rate of the chorus effect. Higher values result in a faster chorus effect."
1033
+ infer_parser.add_argument(
1034
+ "--chorus_rate",
1035
+ type=float,
1036
+ help=chorus_rate_description,
1037
+ default=1.0,
1038
+ required=False,
1039
+ )
1040
+
1041
+ chorus_depth_description = "Control the depth of the chorus effect. Higher values result in a stronger chorus effect."
1042
+ infer_parser.add_argument(
1043
+ "--chorus_depth",
1044
+ type=float,
1045
+ help=chorus_depth_description,
1046
+ default=0.25,
1047
+ required=False,
1048
+ )
1049
+
1050
+ chorus_center_delay_description = "Control the center delay of the chorus effect. Higher values result in a longer center delay."
1051
+ infer_parser.add_argument(
1052
+ "--chorus_center_delay",
1053
+ type=float,
1054
+ help=chorus_center_delay_description,
1055
+ default=7,
1056
+ required=False,
1057
+ )
1058
+
1059
+ chorus_feedback_description = "Control the feedback of the chorus effect. Higher values result in a stronger feedback effect."
1060
+ infer_parser.add_argument(
1061
+ "--chorus_feedback",
1062
+ type=float,
1063
+ help=chorus_feedback_description,
1064
+ default=0.0,
1065
+ required=False,
1066
+ )
1067
+
1068
+ chorus_mix_description = "Control the mix of the chorus effect. Higher values result in a stronger chorus effect."
1069
+ infer_parser.add_argument(
1070
+ "--chorus_mix",
1071
+ type=float,
1072
+ help=chorus_mix_description,
1073
+ default=0.5,
1074
+ required=False,
1075
+ )
1076
+
1077
+ bitcrush_bit_depth_description = "Control the bit depth of the bitcrush effect. Higher values result in a stronger bitcrush effect."
1078
+ infer_parser.add_argument(
1079
+ "--bitcrush_bit_depth",
1080
+ type=int,
1081
+ help=bitcrush_bit_depth_description,
1082
+ default=8,
1083
+ required=False,
1084
+ )
1085
+
1086
+ clipping_threshold_description = "Control the threshold of the clipping effect. Higher values result in a stronger clipping effect."
1087
+ infer_parser.add_argument(
1088
+ "--clipping_threshold",
1089
+ type=float,
1090
+ help=clipping_threshold_description,
1091
+ default=-6,
1092
+ required=False,
1093
+ )
1094
+
1095
+ compressor_threshold_description = "Control the threshold of the compressor effect. Higher values result in a stronger compressor effect."
1096
+ infer_parser.add_argument(
1097
+ "--compressor_threshold",
1098
+ type=float,
1099
+ help=compressor_threshold_description,
1100
+ default=0,
1101
+ required=False,
1102
+ )
1103
+
1104
+ compressor_ratio_description = "Control the ratio of the compressor effect. Higher values result in a stronger compressor effect."
1105
+ infer_parser.add_argument(
1106
+ "--compressor_ratio",
1107
+ type=float,
1108
+ help=compressor_ratio_description,
1109
+ default=1,
1110
+ required=False,
1111
+ )
1112
+
1113
+ compressor_attack_description = "Control the attack of the compressor effect. Higher values result in a stronger compressor effect."
1114
+ infer_parser.add_argument(
1115
+ "--compressor_attack",
1116
+ type=float,
1117
+ help=compressor_attack_description,
1118
+ default=1.0,
1119
+ required=False,
1120
+ )
1121
+
1122
+ compressor_release_description = "Control the release of the compressor effect. Higher values result in a stronger compressor effect."
1123
+ infer_parser.add_argument(
1124
+ "--compressor_release",
1125
+ type=float,
1126
+ help=compressor_release_description,
1127
+ default=100,
1128
+ required=False,
1129
+ )
1130
+
1131
+ delay_seconds_description = "Control the delay time in seconds. Higher values result in a longer delay time."
1132
+ infer_parser.add_argument(
1133
+ "--delay_seconds",
1134
+ type=float,
1135
+ help=delay_seconds_description,
1136
+ default=0.5,
1137
+ required=False,
1138
+ )
1139
+ delay_feedback_description = "Control the feedback of the delay effect. Higher values result in a stronger feedback effect."
1140
+ infer_parser.add_argument(
1141
+ "--delay_feedback",
1142
+ type=float,
1143
+ help=delay_feedback_description,
1144
+ default=0.0,
1145
+ required=False,
1146
+ )
1147
+ delay_mix_description = "Control the mix of the delay effect. Higher values result in a stronger delay effect."
1148
+ infer_parser.add_argument(
1149
+ "--delay_mix",
1150
+ type=float,
1151
+ help=delay_mix_description,
1152
+ default=0.5,
1153
+ required=False,
1154
+ )
1155
+
1156
+ # Parser for 'batch_infer' mode
1157
+ batch_infer_parser = subparsers.add_parser(
1158
+ "batch_infer",
1159
+ help="Run batch inference",
1160
+ )
1161
+ batch_infer_parser.add_argument(
1162
+ "--pitch",
1163
+ type=int,
1164
+ help=pitch_description,
1165
+ choices=range(-24, 25),
1166
+ default=0,
1167
+ )
1168
+ batch_infer_parser.add_argument(
1169
+ "--index_rate",
1170
+ type=float,
1171
+ help=index_rate_description,
1172
+ choices=[i / 100.0 for i in range(0, 101)],
1173
+ default=0.3,
1174
+ )
1175
+ batch_infer_parser.add_argument(
1176
+ "--volume_envelope",
1177
+ type=float,
1178
+ help=volume_envelope_description,
1179
+ choices=[i / 100.0 for i in range(0, 101)],
1180
+ default=1,
1181
+ )
1182
+ batch_infer_parser.add_argument(
1183
+ "--protect",
1184
+ type=float,
1185
+ help=protect_description,
1186
+ choices=[i / 1000.0 for i in range(0, 501)],
1187
+ default=0.33,
1188
+ )
1189
+ batch_infer_parser.add_argument(
1190
+ "--hop_length",
1191
+ type=int,
1192
+ help=hop_length_description,
1193
+ choices=range(1, 513),
1194
+ default=128,
1195
+ )
1196
+ batch_infer_parser.add_argument(
1197
+ "--f0_method",
1198
+ type=str,
1199
+ help=f0_method_description,
1200
+ choices=[
1201
+ "crepe",
1202
+ "crepe-tiny",
1203
+ "rmvpe",
1204
+ "fcpe",
1205
+ "hybrid[crepe+rmvpe]",
1206
+ "hybrid[crepe+fcpe]",
1207
+ "hybrid[rmvpe+fcpe]",
1208
+ "hybrid[crepe+rmvpe+fcpe]",
1209
+ ],
1210
+ default="rmvpe",
1211
+ )
1212
+ batch_infer_parser.add_argument(
1213
+ "--input_folder",
1214
+ type=str,
1215
+ help="Path to the folder containing input audio files.",
1216
+ required=True,
1217
+ )
1218
+ batch_infer_parser.add_argument(
1219
+ "--output_folder",
1220
+ type=str,
1221
+ help="Path to the folder for saving output audio files.",
1222
+ required=True,
1223
+ )
1224
+ batch_infer_parser.add_argument(
1225
+ "--pth_path", type=str, help=pth_path_description, required=True
1226
+ )
1227
+ batch_infer_parser.add_argument(
1228
+ "--index_path", type=str, help=index_path_description, required=True
1229
+ )
1230
+ batch_infer_parser.add_argument(
1231
+ "--split_audio",
1232
+ type=lambda x: bool(strtobool(x)),
1233
+ choices=[True, False],
1234
+ help=split_audio_description,
1235
+ default=False,
1236
+ )
1237
+ batch_infer_parser.add_argument(
1238
+ "--f0_autotune",
1239
+ type=lambda x: bool(strtobool(x)),
1240
+ choices=[True, False],
1241
+ help=f0_autotune_description,
1242
+ default=False,
1243
+ )
1244
+ batch_infer_parser.add_argument(
1245
+ "--f0_autotune_strength",
1246
+ type=float,
1247
+ help=clean_strength_description,
1248
+ choices=[(i / 10) for i in range(11)],
1249
+ default=1.0,
1250
+ )
1251
+ batch_infer_parser.add_argument(
1252
+ "--clean_audio",
1253
+ type=lambda x: bool(strtobool(x)),
1254
+ choices=[True, False],
1255
+ help=clean_audio_description,
1256
+ default=False,
1257
+ )
1258
+ batch_infer_parser.add_argument(
1259
+ "--clean_strength",
1260
+ type=float,
1261
+ help=clean_strength_description,
1262
+ choices=[(i / 10) for i in range(11)],
1263
+ default=0.7,
1264
+ )
1265
+ batch_infer_parser.add_argument(
1266
+ "--export_format",
1267
+ type=str,
1268
+ help=export_format_description,
1269
+ choices=["WAV", "MP3", "FLAC", "OGG", "M4A"],
1270
+ default="WAV",
1271
+ )
1272
+ batch_infer_parser.add_argument(
1273
+ "--embedder_model",
1274
+ type=str,
1275
+ help=embedder_model_description,
1276
+ choices=[
1277
+ "contentvec",
1278
+ "chinese-hubert-base",
1279
+ "japanese-hubert-base",
1280
+ "korean-hubert-base",
1281
+ "custom",
1282
+ ],
1283
+ default="contentvec",
1284
+ )
1285
+ batch_infer_parser.add_argument(
1286
+ "--embedder_model_custom",
1287
+ type=str,
1288
+ help=embedder_model_custom_description,
1289
+ default=None,
1290
+ )
1291
+ batch_infer_parser.add_argument(
1292
+ "--f0_file",
1293
+ type=str,
1294
+ help=f0_file_description,
1295
+ default=None,
1296
+ )
1297
+ batch_infer_parser.add_argument(
1298
+ "--formant_shifting",
1299
+ type=lambda x: bool(strtobool(x)),
1300
+ choices=[True, False],
1301
+ help=formant_shifting_description,
1302
+ default=False,
1303
+ required=False,
1304
+ )
1305
+ batch_infer_parser.add_argument(
1306
+ "--formant_qfrency",
1307
+ type=float,
1308
+ help=formant_qfrency_description,
1309
+ default=1.0,
1310
+ required=False,
1311
+ )
1312
+ batch_infer_parser.add_argument(
1313
+ "--formant_timbre",
1314
+ type=float,
1315
+ help=formant_timbre_description,
1316
+ default=1.0,
1317
+ required=False,
1318
+ )
1319
+ batch_infer_parser.add_argument(
1320
+ "--sid",
1321
+ type=int,
1322
+ help=sid_description,
1323
+ default=0,
1324
+ required=False,
1325
+ )
1326
+ batch_infer_parser.add_argument(
1327
+ "--post_process",
1328
+ type=lambda x: bool(strtobool(x)),
1329
+ choices=[True, False],
1330
+ help=post_process_description,
1331
+ default=False,
1332
+ required=False,
1333
+ )
1334
+ batch_infer_parser.add_argument(
1335
+ "--reverb",
1336
+ type=lambda x: bool(strtobool(x)),
1337
+ choices=[True, False],
1338
+ help=reverb_description,
1339
+ default=False,
1340
+ required=False,
1341
+ )
1342
+
1343
+ batch_infer_parser.add_argument(
1344
+ "--pitch_shift",
1345
+ type=lambda x: bool(strtobool(x)),
1346
+ choices=[True, False],
1347
+ help=pitch_shift_description,
1348
+ default=False,
1349
+ required=False,
1350
+ )
1351
+
1352
+ batch_infer_parser.add_argument(
1353
+ "--limiter",
1354
+ type=lambda x: bool(strtobool(x)),
1355
+ choices=[True, False],
1356
+ help=limiter_description,
1357
+ default=False,
1358
+ required=False,
1359
+ )
1360
+
1361
+ batch_infer_parser.add_argument(
1362
+ "--gain",
1363
+ type=lambda x: bool(strtobool(x)),
1364
+ choices=[True, False],
1365
+ help=gain_description,
1366
+ default=False,
1367
+ required=False,
1368
+ )
1369
+
1370
+ batch_infer_parser.add_argument(
1371
+ "--distortion",
1372
+ type=lambda x: bool(strtobool(x)),
1373
+ choices=[True, False],
1374
+ help=distortion_description,
1375
+ default=False,
1376
+ required=False,
1377
+ )
1378
+
1379
+ batch_infer_parser.add_argument(
1380
+ "--chorus",
1381
+ type=lambda x: bool(strtobool(x)),
1382
+ choices=[True, False],
1383
+ help=chorus_description,
1384
+ default=False,
1385
+ required=False,
1386
+ )
1387
+
1388
+ batch_infer_parser.add_argument(
1389
+ "--bitcrush",
1390
+ type=lambda x: bool(strtobool(x)),
1391
+ choices=[True, False],
1392
+ help=bitcrush_description,
1393
+ default=False,
1394
+ required=False,
1395
+ )
1396
+
1397
+ batch_infer_parser.add_argument(
1398
+ "--clipping",
1399
+ type=lambda x: bool(strtobool(x)),
1400
+ choices=[True, False],
1401
+ help=clipping_description,
1402
+ default=False,
1403
+ required=False,
1404
+ )
1405
+
1406
+ batch_infer_parser.add_argument(
1407
+ "--compressor",
1408
+ type=lambda x: bool(strtobool(x)),
1409
+ choices=[True, False],
1410
+ help=compressor_description,
1411
+ default=False,
1412
+ required=False,
1413
+ )
1414
+
1415
+ batch_infer_parser.add_argument(
1416
+ "--delay",
1417
+ type=lambda x: bool(strtobool(x)),
1418
+ choices=[True, False],
1419
+ help=delay_description,
1420
+ default=False,
1421
+ required=False,
1422
+ )
1423
+
1424
+ batch_infer_parser.add_argument(
1425
+ "--reverb_room_size",
1426
+ type=float,
1427
+ help=reverb_room_size_description,
1428
+ default=0.5,
1429
+ required=False,
1430
+ )
1431
+
1432
+ batch_infer_parser.add_argument(
1433
+ "--reverb_damping",
1434
+ type=float,
1435
+ help=reverb_damping_description,
1436
+ default=0.5,
1437
+ required=False,
1438
+ )
1439
+
1440
+ batch_infer_parser.add_argument(
1441
+ "--reverb_wet_gain",
1442
+ type=float,
1443
+ help=reverb_wet_gain_description,
1444
+ default=0.5,
1445
+ required=False,
1446
+ )
1447
+
1448
+ batch_infer_parser.add_argument(
1449
+ "--reverb_dry_gain",
1450
+ type=float,
1451
+ help=reverb_dry_gain_description,
1452
+ default=0.5,
1453
+ required=False,
1454
+ )
1455
+
1456
+ batch_infer_parser.add_argument(
1457
+ "--reverb_width",
1458
+ type=float,
1459
+ help=reverb_width_description,
1460
+ default=0.5,
1461
+ required=False,
1462
+ )
1463
+
1464
+ batch_infer_parser.add_argument(
1465
+ "--reverb_freeze_mode",
1466
+ type=float,
1467
+ help=reverb_freeze_mode_description,
1468
+ default=0.5,
1469
+ required=False,
1470
+ )
1471
+
1472
+ batch_infer_parser.add_argument(
1473
+ "--pitch_shift_semitones",
1474
+ type=float,
1475
+ help=pitch_shift_semitones_description,
1476
+ default=0.0,
1477
+ required=False,
1478
+ )
1479
+
1480
+ batch_infer_parser.add_argument(
1481
+ "--limiter_threshold",
1482
+ type=float,
1483
+ help=limiter_threshold_description,
1484
+ default=-6,
1485
+ required=False,
1486
+ )
1487
+
1488
+ batch_infer_parser.add_argument(
1489
+ "--limiter_release_time",
1490
+ type=float,
1491
+ help=limiter_release_time_description,
1492
+ default=0.01,
1493
+ required=False,
1494
+ )
1495
+ batch_infer_parser.add_argument(
1496
+ "--gain_db",
1497
+ type=float,
1498
+ help=gain_db_description,
1499
+ default=0.0,
1500
+ required=False,
1501
+ )
1502
+
1503
+ batch_infer_parser.add_argument(
1504
+ "--distortion_gain",
1505
+ type=float,
1506
+ help=distortion_gain_description,
1507
+ default=25,
1508
+ required=False,
1509
+ )
1510
+
1511
+ batch_infer_parser.add_argument(
1512
+ "--chorus_rate",
1513
+ type=float,
1514
+ help=chorus_rate_description,
1515
+ default=1.0,
1516
+ required=False,
1517
+ )
1518
+
1519
+ batch_infer_parser.add_argument(
1520
+ "--chorus_depth",
1521
+ type=float,
1522
+ help=chorus_depth_description,
1523
+ default=0.25,
1524
+ required=False,
1525
+ )
1526
+ batch_infer_parser.add_argument(
1527
+ "--chorus_center_delay",
1528
+ type=float,
1529
+ help=chorus_center_delay_description,
1530
+ default=7,
1531
+ required=False,
1532
+ )
1533
+
1534
+ batch_infer_parser.add_argument(
1535
+ "--chorus_feedback",
1536
+ type=float,
1537
+ help=chorus_feedback_description,
1538
+ default=0.0,
1539
+ required=False,
1540
+ )
1541
+
1542
+ batch_infer_parser.add_argument(
1543
+ "--chorus_mix",
1544
+ type=float,
1545
+ help=chorus_mix_description,
1546
+ default=0.5,
1547
+ required=False,
1548
+ )
1549
+
1550
+ batch_infer_parser.add_argument(
1551
+ "--bitcrush_bit_depth",
1552
+ type=int,
1553
+ help=bitcrush_bit_depth_description,
1554
+ default=8,
1555
+ required=False,
1556
+ )
1557
+
1558
+ batch_infer_parser.add_argument(
1559
+ "--clipping_threshold",
1560
+ type=float,
1561
+ help=clipping_threshold_description,
1562
+ default=-6,
1563
+ required=False,
1564
+ )
1565
+
1566
+ batch_infer_parser.add_argument(
1567
+ "--compressor_threshold",
1568
+ type=float,
1569
+ help=compressor_threshold_description,
1570
+ default=0,
1571
+ required=False,
1572
+ )
1573
+
1574
+ batch_infer_parser.add_argument(
1575
+ "--compressor_ratio",
1576
+ type=float,
1577
+ help=compressor_ratio_description,
1578
+ default=1,
1579
+ required=False,
1580
+ )
1581
+
1582
+ batch_infer_parser.add_argument(
1583
+ "--compressor_attack",
1584
+ type=float,
1585
+ help=compressor_attack_description,
1586
+ default=1.0,
1587
+ required=False,
1588
+ )
1589
+
1590
+ batch_infer_parser.add_argument(
1591
+ "--compressor_release",
1592
+ type=float,
1593
+ help=compressor_release_description,
1594
+ default=100,
1595
+ required=False,
1596
+ )
1597
+ batch_infer_parser.add_argument(
1598
+ "--delay_seconds",
1599
+ type=float,
1600
+ help=delay_seconds_description,
1601
+ default=0.5,
1602
+ required=False,
1603
+ )
1604
+ batch_infer_parser.add_argument(
1605
+ "--delay_feedback",
1606
+ type=float,
1607
+ help=delay_feedback_description,
1608
+ default=0.0,
1609
+ required=False,
1610
+ )
1611
+ batch_infer_parser.add_argument(
1612
+ "--delay_mix",
1613
+ type=float,
1614
+ help=delay_mix_description,
1615
+ default=0.5,
1616
+ required=False,
1617
+ )
1618
+
1619
+ # Parser for 'tts' mode
1620
+ tts_parser = subparsers.add_parser("tts", help="Run TTS inference")
1621
+ tts_parser.add_argument(
1622
+ "--tts_file", type=str, help="File with a text to be synthesized", required=True
1623
+ )
1624
+ tts_parser.add_argument(
1625
+ "--tts_text", type=str, help="Text to be synthesized", required=True
1626
+ )
1627
+ tts_parser.add_argument(
1628
+ "--tts_voice",
1629
+ type=str,
1630
+ help="Voice to be used for TTS synthesis.",
1631
+ choices=locales,
1632
+ required=True,
1633
+ )
1634
+ tts_parser.add_argument(
1635
+ "--tts_rate",
1636
+ type=int,
1637
+ help="Control the speaking rate of the TTS. Values range from -100 (slower) to 100 (faster).",
1638
+ choices=range(-100, 101),
1639
+ default=0,
1640
+ )
1641
+ tts_parser.add_argument(
1642
+ "--pitch",
1643
+ type=int,
1644
+ help=pitch_description,
1645
+ choices=range(-24, 25),
1646
+ default=0,
1647
+ )
1648
+ tts_parser.add_argument(
1649
+ "--index_rate",
1650
+ type=float,
1651
+ help=index_rate_description,
1652
+ choices=[(i / 10) for i in range(11)],
1653
+ default=0.3,
1654
+ )
1655
+ tts_parser.add_argument(
1656
+ "--volume_envelope",
1657
+ type=float,
1658
+ help=volume_envelope_description,
1659
+ choices=[(i / 10) for i in range(11)],
1660
+ default=1,
1661
+ )
1662
+ tts_parser.add_argument(
1663
+ "--protect",
1664
+ type=float,
1665
+ help=protect_description,
1666
+ choices=[(i / 10) for i in range(6)],
1667
+ default=0.33,
1668
+ )
1669
+ tts_parser.add_argument(
1670
+ "--hop_length",
1671
+ type=int,
1672
+ help=hop_length_description,
1673
+ choices=range(1, 513),
1674
+ default=128,
1675
+ )
1676
+ tts_parser.add_argument(
1677
+ "--f0_method",
1678
+ type=str,
1679
+ help=f0_method_description,
1680
+ choices=[
1681
+ "crepe",
1682
+ "crepe-tiny",
1683
+ "rmvpe+",
1684
+ "fcpe",
1685
+ "hybrid[crepe+rmvpe]",
1686
+ "hybrid[crepe+fcpe]",
1687
+ "hybrid[rmvpe+fcpe]",
1688
+ "hybrid[crepe+rmvpe+fcpe]",
1689
+ ],
1690
+ default="rmvpe+",
1691
+ )
1692
+ tts_parser.add_argument(
1693
+ "--output_tts_path",
1694
+ type=str,
1695
+ help="Full path to save the synthesized TTS audio.",
1696
+ required=True,
1697
+ )
1698
+ tts_parser.add_argument(
1699
+ "--output_rvc_path",
1700
+ type=str,
1701
+ help="Full path to save the voice-converted audio using the synthesized TTS.",
1702
+ required=True,
1703
+ )
1704
+ tts_parser.add_argument(
1705
+ "--pth_path", type=str, help=pth_path_description, required=True
1706
+ )
1707
+ tts_parser.add_argument(
1708
+ "--index_path", type=str, help=index_path_description, required=True
1709
+ )
1710
+ tts_parser.add_argument(
1711
+ "--split_audio",
1712
+ type=lambda x: bool(strtobool(x)),
1713
+ choices=[True, False],
1714
+ help=split_audio_description,
1715
+ default=False,
1716
+ )
1717
+ tts_parser.add_argument(
1718
+ "--f0_autotune",
1719
+ type=lambda x: bool(strtobool(x)),
1720
+ choices=[True, False],
1721
+ help=f0_autotune_description,
1722
+ default=False,
1723
+ )
1724
+ tts_parser.add_argument(
1725
+ "--f0_autotune_strength",
1726
+ type=float,
1727
+ help=clean_strength_description,
1728
+ choices=[(i / 10) for i in range(11)],
1729
+ default=1.0,
1730
+ )
1731
+ tts_parser.add_argument(
1732
+ "--clean_audio",
1733
+ type=lambda x: bool(strtobool(x)),
1734
+ choices=[True, False],
1735
+ help=clean_audio_description,
1736
+ default=False,
1737
+ )
1738
+ tts_parser.add_argument(
1739
+ "--clean_strength",
1740
+ type=float,
1741
+ help=clean_strength_description,
1742
+ choices=[(i / 10) for i in range(11)],
1743
+ default=0.7,
1744
+ )
1745
+ tts_parser.add_argument(
1746
+ "--export_format",
1747
+ type=str,
1748
+ help=export_format_description,
1749
+ choices=["WAV", "MP3", "FLAC", "OGG", "M4A"],
1750
+ default="WAV",
1751
+ )
1752
+ tts_parser.add_argument(
1753
+ "--embedder_model",
1754
+ type=str,
1755
+ help=embedder_model_description,
1756
+ choices=[
1757
+ "contentvec",
1758
+ "chinese-hubert-base",
1759
+ "japanese-hubert-base",
1760
+ "korean-hubert-base",
1761
+ "custom",
1762
+ ],
1763
+ default="contentvec",
1764
+ )
1765
+ tts_parser.add_argument(
1766
+ "--embedder_model_custom",
1767
+ type=str,
1768
+ help=embedder_model_custom_description,
1769
+ default=None,
1770
+ )
1771
+ tts_parser.add_argument(
1772
+ "--f0_file",
1773
+ type=str,
1774
+ help=f0_file_description,
1775
+ default=None,
1776
+ )
1777
+
1778
+ # Parser for 'preprocess' mode
1779
+ preprocess_parser = subparsers.add_parser(
1780
+ "preprocess", help="Preprocess a dataset for training."
1781
+ )
1782
+ preprocess_parser.add_argument(
1783
+ "--model_name", type=str, help="Name of the model to be trained.", required=True
1784
+ )
1785
+ preprocess_parser.add_argument(
1786
+ "--dataset_path", type=str, help="Path to the dataset directory.", required=True
1787
+ )
1788
+ preprocess_parser.add_argument(
1789
+ "--sample_rate",
1790
+ type=int,
1791
+ help="Target sampling rate for the audio data.",
1792
+ choices=[32000, 40000, 48000],
1793
+ required=True,
1794
+ )
1795
+ preprocess_parser.add_argument(
1796
+ "--cpu_cores",
1797
+ type=int,
1798
+ help="Number of CPU cores to use for preprocessing.",
1799
+ choices=range(1, 65),
1800
+ )
1801
+ preprocess_parser.add_argument(
1802
+ "--cut_preprocess",
1803
+ type=str,
1804
+ choices=["Skip", "Simple", "Automatic"],
1805
+ help="Cut the dataset into smaller segments for faster preprocessing.",
1806
+ default="Automatic",
1807
+ required=True,
1808
+ )
1809
+ preprocess_parser.add_argument(
1810
+ "--process_effects",
1811
+ type=lambda x: bool(strtobool(x)),
1812
+ choices=[True, False],
1813
+ help="Disable all filters during preprocessing.",
1814
+ default=False,
1815
+ required=False,
1816
+ )
1817
+ preprocess_parser.add_argument(
1818
+ "--noise_reduction",
1819
+ type=lambda x: bool(strtobool(x)),
1820
+ choices=[True, False],
1821
+ help="Enable noise reduction during preprocessing.",
1822
+ default=False,
1823
+ required=False,
1824
+ )
1825
+ preprocess_parser.add_argument(
1826
+ "--noise_reduction_strength",
1827
+ type=float,
1828
+ help="Strength of the noise reduction filter.",
1829
+ choices=[(i / 10) for i in range(11)],
1830
+ default=0.7,
1831
+ required=False,
1832
+ )
1833
+ preprocess_parser.add_argument(
1834
+ "--chunk_len",
1835
+ type=float,
1836
+ help="Chunk length.",
1837
+ choices=[i * 0.5 for i in range(1, 11)],
1838
+ default=3.0,
1839
+ required=False,
1840
+ )
1841
+ preprocess_parser.add_argument(
1842
+ "--overlap_len",
1843
+ type=float,
1844
+ help="Overlap length.",
1845
+ choices=[0.0, 0.1, 0.2, 0.3, 0.4],
1846
+ default=0.3,
1847
+ required=False,
1848
+ )
1849
+
1850
+ # Parser for 'extract' mode
1851
+ extract_parser = subparsers.add_parser(
1852
+ "extract", help="Extract features from a dataset."
1853
+ )
1854
+ extract_parser.add_argument(
1855
+ "--model_name", type=str, help="Name of the model.", required=True
1856
+ )
1857
+ extract_parser.add_argument(
1858
+ "--f0_method",
1859
+ type=str,
1860
+ help="Pitch extraction method to use.",
1861
+ choices=[
1862
+ "crepe",
1863
+ "crepe-tiny",
1864
+ "rmvpe",
1865
+ ],
1866
+ default="rmvpe",
1867
+ )
1868
+ extract_parser.add_argument(
1869
+ "--hop_length",
1870
+ type=int,
1871
+ help="Hop length for feature extraction. Only applicable for Crepe pitch extraction.",
1872
+ choices=range(1, 513),
1873
+ default=128,
1874
+ )
1875
+ extract_parser.add_argument(
1876
+ "--cpu_cores",
1877
+ type=int,
1878
+ help="Number of CPU cores to use for feature extraction (optional).",
1879
+ choices=range(1, 65),
1880
+ default=None,
1881
+ )
1882
+ extract_parser.add_argument(
1883
+ "--gpu",
1884
+ type=str,
1885
+ help="GPU device to use for feature extraction (optional).",
1886
+ default="-",
1887
+ )
1888
+ extract_parser.add_argument(
1889
+ "--sample_rate",
1890
+ type=int,
1891
+ help="Target sampling rate for the audio data.",
1892
+ choices=[32000, 40000, 44100, 48000],
1893
+ required=True,
1894
+ )
1895
+ extract_parser.add_argument(
1896
+ "--embedder_model",
1897
+ type=str,
1898
+ help=embedder_model_description,
1899
+ choices=[
1900
+ "contentvec",
1901
+ "chinese-hubert-base",
1902
+ "japanese-hubert-base",
1903
+ "korean-hubert-base",
1904
+ "custom",
1905
+ ],
1906
+ default="contentvec",
1907
+ )
1908
+ extract_parser.add_argument(
1909
+ "--embedder_model_custom",
1910
+ type=str,
1911
+ help=embedder_model_custom_description,
1912
+ default=None,
1913
+ )
1914
+ extract_parser.add_argument(
1915
+ "--include_mutes",
1916
+ type=int,
1917
+ help="Number of silent files to include.",
1918
+ choices=range(0, 11),
1919
+ default=2,
1920
+ required=True,
1921
+ )
1922
+
1923
+ # Parser for 'train' mode
1924
+ train_parser = subparsers.add_parser("train", help="Train an RVC model.")
1925
+ train_parser.add_argument(
1926
+ "--model_name", type=str, help="Name of the model to be trained.", required=True
1927
+ )
1928
+ train_parser.add_argument(
1929
+ "--vocoder",
1930
+ type=str,
1931
+ help="Vocoder name",
1932
+ choices=["HiFi-GAN", "MRF HiFi-GAN", "RefineGAN"],
1933
+ default="HiFi-GAN",
1934
+ )
1935
+ train_parser.add_argument(
1936
+ "--checkpointing",
1937
+ type=lambda x: bool(strtobool(x)),
1938
+ choices=[True, False],
1939
+ help="Enables memory-efficient training.",
1940
+ default=False,
1941
+ required=False,
1942
+ )
1943
+ train_parser.add_argument(
1944
+ "--save_every_epoch",
1945
+ type=int,
1946
+ help="Save the model every specified number of epochs.",
1947
+ choices=range(1, 10001),
1948
+ required=True,
1949
+ )
1950
+ train_parser.add_argument(
1951
+ "--save_only_latest",
1952
+ type=lambda x: bool(strtobool(x)),
1953
+ choices=[True, False],
1954
+ help="Save only the latest model checkpoint.",
1955
+ default=False,
1956
+ )
1957
+ train_parser.add_argument(
1958
+ "--save_every_weights",
1959
+ type=lambda x: bool(strtobool(x)),
1960
+ choices=[True, False],
1961
+ help="Save model weights every epoch.",
1962
+ default=True,
1963
+ )
1964
+ train_parser.add_argument(
1965
+ "--total_epoch",
1966
+ type=int,
1967
+ help="Total number of epochs to train for.",
1968
+ choices=range(1, 10001),
1969
+ default=1000,
1970
+ )
1971
+ train_parser.add_argument(
1972
+ "--sample_rate",
1973
+ type=int,
1974
+ help="Sampling rate of the training data.",
1975
+ choices=[32000, 40000, 48000],
1976
+ required=True,
1977
+ )
1978
+ train_parser.add_argument(
1979
+ "--batch_size",
1980
+ type=int,
1981
+ help="Batch size for training.",
1982
+ choices=range(1, 51),
1983
+ default=8,
1984
+ )
1985
+ train_parser.add_argument(
1986
+ "--gpu",
1987
+ type=str,
1988
+ help="GPU device to use for training (e.g., '0').",
1989
+ default="0",
1990
+ )
1991
+ train_parser.add_argument(
1992
+ "--pretrained",
1993
+ type=lambda x: bool(strtobool(x)),
1994
+ choices=[True, False],
1995
+ help="Use a pretrained model for initialization.",
1996
+ default=True,
1997
+ )
1998
+ train_parser.add_argument(
1999
+ "--custom_pretrained",
2000
+ type=lambda x: bool(strtobool(x)),
2001
+ choices=[True, False],
2002
+ help="Use a custom pretrained model.",
2003
+ default=False,
2004
+ )
2005
+ train_parser.add_argument(
2006
+ "--g_pretrained_path",
2007
+ type=str,
2008
+ nargs="?",
2009
+ default=None,
2010
+ help="Path to the pretrained generator model file.",
2011
+ )
2012
+ train_parser.add_argument(
2013
+ "--d_pretrained_path",
2014
+ type=str,
2015
+ nargs="?",
2016
+ default=None,
2017
+ help="Path to the pretrained discriminator model file.",
2018
+ )
2019
+ train_parser.add_argument(
2020
+ "--overtraining_detector",
2021
+ type=lambda x: bool(strtobool(x)),
2022
+ choices=[True, False],
2023
+ help="Enable overtraining detection.",
2024
+ default=False,
2025
+ )
2026
+ train_parser.add_argument(
2027
+ "--overtraining_threshold",
2028
+ type=int,
2029
+ help="Threshold for overtraining detection.",
2030
+ choices=range(1, 101),
2031
+ default=50,
2032
+ )
2033
+ train_parser.add_argument(
2034
+ "--cleanup",
2035
+ type=lambda x: bool(strtobool(x)),
2036
+ choices=[True, False],
2037
+ help="Cleanup previous training attempt.",
2038
+ default=False,
2039
+ )
2040
+ train_parser.add_argument(
2041
+ "--cache_data_in_gpu",
2042
+ type=lambda x: bool(strtobool(x)),
2043
+ choices=[True, False],
2044
+ help="Cache training data in GPU memory.",
2045
+ default=False,
2046
+ )
2047
+ train_parser.add_argument(
2048
+ "--index_algorithm",
2049
+ type=str,
2050
+ choices=["Auto", "Faiss", "KMeans"],
2051
+ help="Choose the method for generating the index file.",
2052
+ default="Auto",
2053
+ required=False,
2054
+ )
2055
+
2056
+ # Parser for 'index' mode
2057
+ index_parser = subparsers.add_parser(
2058
+ "index", help="Generate an index file for an RVC model."
2059
+ )
2060
+ index_parser.add_argument(
2061
+ "--model_name", type=str, help="Name of the model.", required=True
2062
+ )
2063
+ index_parser.add_argument(
2064
+ "--index_algorithm",
2065
+ type=str,
2066
+ choices=["Auto", "Faiss", "KMeans"],
2067
+ help="Choose the method for generating the index file.",
2068
+ default="Auto",
2069
+ required=False,
2070
+ )
2071
+
2072
+ # Parser for 'model_information' mode
2073
+ model_information_parser = subparsers.add_parser(
2074
+ "model_information", help="Display information about a trained model."
2075
+ )
2076
+ model_information_parser.add_argument(
2077
+ "--pth_path", type=str, help="Path to the .pth model file.", required=True
2078
+ )
2079
+
2080
+ # Parser for 'model_blender' mode
2081
+ model_blender_parser = subparsers.add_parser(
2082
+ "model_blender", help="Fuse two RVC models together."
2083
+ )
2084
+ model_blender_parser.add_argument(
2085
+ "--model_name", type=str, help="Name of the new fused model.", required=True
2086
+ )
2087
+ model_blender_parser.add_argument(
2088
+ "--pth_path_1",
2089
+ type=str,
2090
+ help="Path to the first .pth model file.",
2091
+ required=True,
2092
+ )
2093
+ model_blender_parser.add_argument(
2094
+ "--pth_path_2",
2095
+ type=str,
2096
+ help="Path to the second .pth model file.",
2097
+ required=True,
2098
+ )
2099
+ model_blender_parser.add_argument(
2100
+ "--ratio",
2101
+ type=float,
2102
+ help="Ratio for blending the two models (0.0 to 1.0).",
2103
+ choices=[(i / 10) for i in range(11)],
2104
+ default=0.5,
2105
+ )
2106
+
2107
+ # Parser for 'tensorboard' mode
2108
+ subparsers.add_parser(
2109
+ "tensorboard", help="Launch TensorBoard for monitoring training progress."
2110
+ )
2111
+
2112
+ # Parser for 'download' mode
2113
+ download_parser = subparsers.add_parser(
2114
+ "download", help="Download a model from a provided link."
2115
+ )
2116
+ download_parser.add_argument(
2117
+ "--model_link", type=str, help="Direct link to the model file.", required=True
2118
+ )
2119
+
2120
+ # Parser for 'prerequisites' mode
2121
+ prerequisites_parser = subparsers.add_parser(
2122
+ "prerequisites", help="Install prerequisites for RVC."
2123
+ )
2124
+ prerequisites_parser.add_argument(
2125
+ "--pretraineds_hifigan",
2126
+ type=lambda x: bool(strtobool(x)),
2127
+ choices=[True, False],
2128
+ default=True,
2129
+ help="Download pretrained models for RVC v2.",
2130
+ )
2131
+ prerequisites_parser.add_argument(
2132
+ "--models",
2133
+ type=lambda x: bool(strtobool(x)),
2134
+ choices=[True, False],
2135
+ default=True,
2136
+ help="Download additional models.",
2137
+ )
2138
+ prerequisites_parser.add_argument(
2139
+ "--exe",
2140
+ type=lambda x: bool(strtobool(x)),
2141
+ choices=[True, False],
2142
+ default=True,
2143
+ help="Download required executables.",
2144
+ )
2145
+
2146
+ # Parser for 'audio_analyzer' mode
2147
+ audio_analyzer = subparsers.add_parser(
2148
+ "audio_analyzer", help="Analyze an audio file."
2149
+ )
2150
+ audio_analyzer.add_argument(
2151
+ "--input_path", type=str, help="Path to the input audio file.", required=True
2152
+ )
2153
+
2154
+ return parser.parse_args()
2155
+
2156
+
2157
+ def main():
2158
+ if len(sys.argv) == 1:
2159
+ print("Please run the script with '-h' for more information.")
2160
+ sys.exit(1)
2161
+
2162
+ args = parse_arguments()
2163
+
2164
+ try:
2165
+ if args.mode == "infer":
2166
+ run_infer_script(
2167
+ pitch=args.pitch,
2168
+ index_rate=args.index_rate,
2169
+ volume_envelope=args.volume_envelope,
2170
+ protect=args.protect,
2171
+ hop_length=args.hop_length,
2172
+ f0_method=args.f0_method,
2173
+ input_path=args.input_path,
2174
+ output_path=args.output_path,
2175
+ pth_path=args.pth_path,
2176
+ index_path=args.index_path,
2177
+ split_audio=args.split_audio,
2178
+ f0_autotune=args.f0_autotune,
2179
+ f0_autotune_strength=args.f0_autotune_strength,
2180
+ clean_audio=args.clean_audio,
2181
+ clean_strength=args.clean_strength,
2182
+ export_format=args.export_format,
2183
+ embedder_model=args.embedder_model,
2184
+ embedder_model_custom=args.embedder_model_custom,
2185
+ f0_file=args.f0_file,
2186
+ formant_shifting=args.formant_shifting,
2187
+ formant_qfrency=args.formant_qfrency,
2188
+ formant_timbre=args.formant_timbre,
2189
+ sid=args.sid,
2190
+ post_process=args.post_process,
2191
+ reverb=args.reverb,
2192
+ pitch_shift=args.pitch_shift,
2193
+ limiter=args.limiter,
2194
+ gain=args.gain,
2195
+ distortion=args.distortion,
2196
+ chorus=args.chorus,
2197
+ bitcrush=args.bitcrush,
2198
+ clipping=args.clipping,
2199
+ compressor=args.compressor,
2200
+ delay=args.delay,
2201
+ reverb_room_size=args.reverb_room_size,
2202
+ reverb_damping=args.reverb_damping,
2203
+ reverb_wet_gain=args.reverb_wet_gain,
2204
+ reverb_dry_gain=args.reverb_dry_gain,
2205
+ reverb_width=args.reverb_width,
2206
+ reverb_freeze_mode=args.reverb_freeze_mode,
2207
+ pitch_shift_semitones=args.pitch_shift_semitones,
2208
+ limiter_threshold=args.limiter_threshold,
2209
+ limiter_release_time=args.limiter_release_time,
2210
+ gain_db=args.gain_db,
2211
+ distortion_gain=args.distortion_gain,
2212
+ chorus_rate=args.chorus_rate,
2213
+ chorus_depth=args.chorus_depth,
2214
+ chorus_center_delay=args.chorus_center_delay,
2215
+ chorus_feedback=args.chorus_feedback,
2216
+ chorus_mix=args.chorus_mix,
2217
+ bitcrush_bit_depth=args.bitcrush_bit_depth,
2218
+ clipping_threshold=args.clipping_threshold,
2219
+ compressor_threshold=args.compressor_threshold,
2220
+ compressor_ratio=args.compressor_ratio,
2221
+ compressor_attack=args.compressor_attack,
2222
+ compressor_release=args.compressor_release,
2223
+ delay_seconds=args.delay_seconds,
2224
+ delay_feedback=args.delay_feedback,
2225
+ delay_mix=args.delay_mix,
2226
+ )
2227
+ elif args.mode == "batch_infer":
2228
+ run_batch_infer_script(
2229
+ pitch=args.pitch,
2230
+ index_rate=args.index_rate,
2231
+ volume_envelope=args.volume_envelope,
2232
+ protect=args.protect,
2233
+ hop_length=args.hop_length,
2234
+ f0_method=args.f0_method,
2235
+ input_folder=args.input_folder,
2236
+ output_folder=args.output_folder,
2237
+ pth_path=args.pth_path,
2238
+ index_path=args.index_path,
2239
+ split_audio=args.split_audio,
2240
+ f0_autotune=args.f0_autotune,
2241
+ f0_autotune_strength=args.f0_autotune_strength,
2242
+ clean_audio=args.clean_audio,
2243
+ clean_strength=args.clean_strength,
2244
+ export_format=args.export_format,
2245
+ embedder_model=args.embedder_model,
2246
+ embedder_model_custom=args.embedder_model_custom,
2247
+ f0_file=args.f0_file,
2248
+ formant_shifting=args.formant_shifting,
2249
+ formant_qfrency=args.formant_qfrency,
2250
+ formant_timbre=args.formant_timbre,
2251
+ sid=args.sid,
2252
+ post_process=args.post_process,
2253
+ reverb=args.reverb,
2254
+ pitch_shift=args.pitch_shift,
2255
+ limiter=args.limiter,
2256
+ gain=args.gain,
2257
+ distortion=args.distortion,
2258
+ chorus=args.chorus,
2259
+ bitcrush=args.bitcrush,
2260
+ clipping=args.clipping,
2261
+ compressor=args.compressor,
2262
+ delay=args.delay,
2263
+ reverb_room_size=args.reverb_room_size,
2264
+ reverb_damping=args.reverb_damping,
2265
+ reverb_wet_gain=args.reverb_wet_gain,
2266
+ reverb_dry_gain=args.reverb_dry_gain,
2267
+ reverb_width=args.reverb_width,
2268
+ reverb_freeze_mode=args.reverb_freeze_mode,
2269
+ pitch_shift_semitones=args.pitch_shift_semitones,
2270
+ limiter_threshold=args.limiter_threshold,
2271
+ limiter_release_time=args.limiter_release_time,
2272
+ gain_db=args.gain_db,
2273
+ distortion_gain=args.distortion_gain,
2274
+ chorus_rate=args.chorus_rate,
2275
+ chorus_depth=args.chorus_depth,
2276
+ chorus_center_delay=args.chorus_center_delay,
2277
+ chorus_feedback=args.chorus_feedback,
2278
+ chorus_mix=args.chorus_mix,
2279
+ bitcrush_bit_depth=args.bitcrush_bit_depth,
2280
+ clipping_threshold=args.clipping_threshold,
2281
+ compressor_threshold=args.compressor_threshold,
2282
+ compressor_ratio=args.compressor_ratio,
2283
+ compressor_attack=args.compressor_attack,
2284
+ compressor_release=args.compressor_release,
2285
+ delay_seconds=args.delay_seconds,
2286
+ delay_feedback=args.delay_feedback,
2287
+ delay_mix=args.delay_mix,
2288
+ )
2289
+ elif args.mode == "tts":
2290
+ run_tts_script(
2291
+ tts_file=args.tts_file,
2292
+ tts_text=args.tts_text,
2293
+ tts_voice=args.tts_voice,
2294
+ tts_rate=args.tts_rate,
2295
+ pitch=args.pitch,
2296
+ index_rate=args.index_rate,
2297
+ volume_envelope=args.volume_envelope,
2298
+ protect=args.protect,
2299
+ hop_length=args.hop_length,
2300
+ f0_method=args.f0_method,
2301
+ output_tts_path=args.output_tts_path,
2302
+ output_rvc_path=args.output_rvc_path,
2303
+ pth_path=args.pth_path,
2304
+ index_path=args.index_path,
2305
+ split_audio=args.split_audio,
2306
+ f0_autotune=args.f0_autotune,
2307
+ f0_autotune_strength=args.f0_autotune_strength,
2308
+ clean_audio=args.clean_audio,
2309
+ clean_strength=args.clean_strength,
2310
+ export_format=args.export_format,
2311
+ embedder_model=args.embedder_model,
2312
+ embedder_model_custom=args.embedder_model_custom,
2313
+ f0_file=args.f0_file,
2314
+ )
2315
+ elif args.mode == "preprocess":
2316
+ run_preprocess_script(
2317
+ model_name=args.model_name,
2318
+ dataset_path=args.dataset_path,
2319
+ sample_rate=args.sample_rate,
2320
+ cpu_cores=args.cpu_cores,
2321
+ cut_preprocess=args.cut_preprocess,
2322
+ process_effects=args.process_effects,
2323
+ noise_reduction=args.noise_reduction,
2324
+ clean_strength=args.noise_reduction_strength,
2325
+ chunk_len=args.chunk_len,
2326
+ overlap_len=args.overlap_len,
2327
+ )
2328
+ elif args.mode == "extract":
2329
+ run_extract_script(
2330
+ model_name=args.model_name,
2331
+ f0_method=args.f0_method,
2332
+ hop_length=args.hop_length,
2333
+ cpu_cores=args.cpu_cores,
2334
+ gpu=args.gpu,
2335
+ sample_rate=args.sample_rate,
2336
+ embedder_model=args.embedder_model,
2337
+ embedder_model_custom=args.embedder_model_custom,
2338
+ include_mutes=args.include_mutes,
2339
+ )
2340
+ elif args.mode == "train":
2341
+ run_train_script(
2342
+ model_name=args.model_name,
2343
+ save_every_epoch=args.save_every_epoch,
2344
+ save_only_latest=args.save_only_latest,
2345
+ save_every_weights=args.save_every_weights,
2346
+ total_epoch=args.total_epoch,
2347
+ sample_rate=args.sample_rate,
2348
+ batch_size=args.batch_size,
2349
+ gpu=args.gpu,
2350
+ overtraining_detector=args.overtraining_detector,
2351
+ overtraining_threshold=args.overtraining_threshold,
2352
+ pretrained=args.pretrained,
2353
+ custom_pretrained=args.custom_pretrained,
2354
+ cleanup=args.cleanup,
2355
+ index_algorithm=args.index_algorithm,
2356
+ cache_data_in_gpu=args.cache_data_in_gpu,
2357
+ g_pretrained_path=args.g_pretrained_path,
2358
+ d_pretrained_path=args.d_pretrained_path,
2359
+ vocoder=args.vocoder,
2360
+ checkpointing=args.checkpointing,
2361
+ )
2362
+ elif args.mode == "index":
2363
+ run_index_script(
2364
+ model_name=args.model_name,
2365
+ index_algorithm=args.index_algorithm,
2366
+ )
2367
+ elif args.mode == "model_information":
2368
+ run_model_information_script(
2369
+ pth_path=args.pth_path,
2370
+ )
2371
+ elif args.mode == "model_blender":
2372
+ run_model_blender_script(
2373
+ model_name=args.model_name,
2374
+ pth_path_1=args.pth_path_1,
2375
+ pth_path_2=args.pth_path_2,
2376
+ ratio=args.ratio,
2377
+ )
2378
+ elif args.mode == "tensorboard":
2379
+ run_tensorboard_script()
2380
+ elif args.mode == "download":
2381
+ run_download_script(
2382
+ model_link=args.model_link,
2383
+ )
2384
+ elif args.mode == "prerequisites":
2385
+ run_prerequisites_script(
2386
+ pretraineds_hifigan=args.pretraineds_hifigan,
2387
+ models=args.models,
2388
+ exe=args.exe,
2389
+ )
2390
+ elif args.mode == "audio_analyzer":
2391
+ run_audio_analyzer_script(
2392
+ input_path=args.input_path,
2393
+ )
2394
+ except Exception as error:
2395
+ print(f"An error occurred during execution: {error}")
2396
+
2397
+ import traceback
2398
+
2399
+ traceback.print_exc()
2400
+
2401
+
2402
+ if __name__ == "__main__":
2403
+ main()