Hector Li commited on
Commit
df93d13
·
0 Parent(s):

Initial commit for Hugging Face

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +9 -0
  2. .gitignore +40 -0
  3. LICENSE +21 -0
  4. README.md +283 -0
  5. README_OLD.md +382 -0
  6. README_V1.md +102 -0
  7. app.py +356 -0
  8. automated_pipeline.sh +97 -0
  9. build_faiss_index.py +51 -0
  10. build_mmap.py +85 -0
  11. compare_audio.py +20 -0
  12. configs/base.yaml +71 -0
  13. configs/singers/singer0001.npy +0 -0
  14. configs/singers/singer0002.npy +0 -0
  15. configs/singers/singer0003.npy +0 -0
  16. configs/singers/singer0004.npy +0 -0
  17. configs/singers/singer0005.npy +0 -0
  18. configs/singers/singer0006.npy +0 -0
  19. configs/singers/singer0007.npy +0 -0
  20. configs/singers/singer0008.npy +0 -0
  21. configs/singers/singer0009.npy +0 -0
  22. configs/singers/singer0010.npy +0 -0
  23. configs/singers/singer0011.npy +0 -0
  24. configs/singers/singer0012.npy +0 -0
  25. configs/singers/singer0013.npy +0 -0
  26. configs/singers/singer0014.npy +0 -0
  27. configs/singers/singer0015.npy +0 -0
  28. configs/singers/singer0016.npy +0 -0
  29. configs/singers/singer0017.npy +0 -0
  30. configs/singers/singer0018.npy +0 -0
  31. configs/singers/singer0019.npy +0 -0
  32. configs/singers/singer0020.npy +0 -0
  33. configs/singers/singer0021.npy +0 -0
  34. configs/singers/singer0022.npy +0 -0
  35. configs/singers/singer0023.npy +0 -0
  36. configs/singers/singer0024.npy +0 -0
  37. configs/singers/singer0025.npy +0 -0
  38. configs/singers/singer0026.npy +0 -0
  39. configs/singers/singer0027.npy +0 -0
  40. configs/singers/singer0028.npy +0 -0
  41. configs/singers/singer0029.npy +0 -0
  42. configs/singers/singer0030.npy +0 -0
  43. configs/singers/singer0031.npy +0 -0
  44. configs/singers/singer0032.npy +0 -0
  45. configs/singers/singer0033.npy +0 -0
  46. configs/singers/singer0034.npy +0 -0
  47. configs/singers/singer0035.npy +0 -0
  48. configs/singers/singer0036.npy +0 -0
  49. configs/singers/singer0037.npy +0 -0
  50. configs/singers/singer0038.npy +0 -0
.gitattributes ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ *.hdf5 filter=lfs diff=lfs merge=lfs -text
2
+ *.wav filter=lfs diff=lfs merge=lfs -text
3
+ *.cWG5V7 filter=lfs diff=lfs merge=lfs -text
4
+ *.pt filter=lfs diff=lfs merge=lfs -text
5
+ *.pth filter=lfs diff=lfs merge=lfs -text
6
+ *.png filter=lfs diff=lfs merge=lfs -text
7
+ *.pdf filter=lfs diff=lfs merge=lfs -text
8
+ *.exe filter=lfs diff=lfs merge=lfs -text
9
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+
3
+ runtime/
4
+ .venv/
5
+ venv/
6
+ .venv_linux/
7
+ .vscode/
8
+
9
+ *_pretrain/
10
+ crepe/assets/full.pth
11
+
12
+ chkpt/
13
+ data_svc/
14
+ dataset_raw/
15
+ files/
16
+ logs/
17
+
18
+ sovits5.0.pth
19
+ svc_out_pit.wav
20
+ svc_out.wav
21
+ svc_tmp.pit.csv
22
+ svc_tmp.ppg.npy
23
+ svc_tmp.vec.npy
24
+ test.wav
25
+
26
+ so-vits-svc-5.0-*.zip
27
+
28
+ # Ignore model checkpoints and large audio arrays
29
+ *.pt
30
+ *.pth
31
+ model_1200000.safetensors
32
+ *.wav
33
+ chkpt/
34
+ chkpt_cfm/
35
+ logs/
36
+
37
+ opensinger/
38
+ dataset_raw_old/
39
+ data_svc_infer/
40
+ stable-audio-tools/
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 PlayVoice
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CFM-SVC / F5-SVC — Singing Voice Conversion
2
+
3
+ Two implementations of a flow-matching-based Singing Voice Conversion (SVC) system.
4
+
5
+ | | V1 (CFM-SVC) | V2 (F5-SVC) |
6
+ |---|---|---|
7
+ | Backbone | DiT trained from scratch | F5-TTS pretrained (LoRA) |
8
+ | Output space | DAC codec latents (1024-dim) | Log-mel spectrogram (100-dim) |
9
+ | Vocoder | DAC decoder (frozen) | Vocos (frozen) |
10
+ | Params trained | ~82M | ~5M (adapter + LoRA) |
11
+ | Training data | Multi-speaker singing | Multi-speaker singing |
12
+ | Speaker adaptation | Speaker d-vector | Stage 2: spk_proj on speech clips |
13
+
14
+ ---
15
+
16
+ ## Project Structure
17
+
18
+ ```
19
+ matcha_svc/
20
+ ├── models/
21
+ │ ├── cfm.py V1: Diffusion Transformer (DiT)
22
+ │ ├── cond_encoder.py V1: PPG+HuBERT+F0+Speaker → conditioning
23
+ │ ├── codec_wrapper.py V1: DAC codec + projector head
24
+ │ ├── svc_cond_adapter.py V2: PPG+HuBERT+F0+Speaker → F5-TTS text_dim
25
+ │ ├── lora_utils.py V2: LoRALinear, inject_lora(), freeze_non_lora()
26
+ │ └── f5_svc.py V2: F5SVCModel wrapper + build_f5svc() factory
27
+
28
+ ├── losses/
29
+ │ └── cfm_loss.py V1: flow matching + projector commitment loss
30
+
31
+ ├── svc_data/
32
+ │ └── mel_svc_dataset.py V2: log-mel dataset (same directory layout as V1)
33
+
34
+ ├── train_cfm.py V1 training script
35
+ ├── train_f5_stage1.py V2 Stage 1: SVCCondAdapter + LoRA on singing data
36
+ ├── train_f5_stage2.py V2 Stage 2: spk_proj on target speaker speech
37
+ ├── infer_f5_svc.py V2 inference: Euler sampling → Vocos → .wav
38
+ ├── submit_train.sh SLURM job script for V1
39
+
40
+ ├── data_svc/ Preprocessed features (generated by svc_preprocessing.py)
41
+ │ ├── audio/<spk>/<id>.wav
42
+ │ ├── whisper/<spk>/<id>.ppg.npy
43
+ │ ├── hubert/<spk>/<id>.vec.npy
44
+ │ ├── pitch/<spk>/<id>.pit.npy
45
+ │ ├── speaker/<spk>/<id>.spk.npy
46
+ │ └── codec_targets/<spk>/<id>.pt ← V1 only
47
+
48
+ ├── chkpt_cfm/ V1 checkpoints
49
+ └── chkpt_f5svc/ V2 checkpoints
50
+ ```
51
+
52
+ ---
53
+
54
+ ## Prerequisites
55
+
56
+ ```bash
57
+ python -m venv .venv
58
+ source .venv/bin/activate # or .venv\Scripts\activate on Windows
59
+
60
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
61
+ pip install -r requirements.txt
62
+ pip install descript-audio-codec # V1
63
+ pip install f5-tts vocos safetensors huggingface_hub # V2
64
+ ```
65
+
66
+ Pretrained feature extractors (shared by V1 and V2):
67
+
68
+ | File | Destination |
69
+ |---|---|
70
+ | `best_model.pth.tar` (Speaker encoder) | `speaker_pretrain/` |
71
+ | `large-v2.pt` (Whisper) | `whisper_pretrain/` |
72
+ | `hubert-soft-0d54a1f4.pt` | `hubert_pretrain/` |
73
+ | `full.pth` (CREPE) | `crepe/assets/` |
74
+
75
+ ---
76
+
77
+ ## Data Preparation (shared by V1 and V2)
78
+
79
+ ### 1. Raw audio layout
80
+
81
+ ```
82
+ dataset_raw/
83
+ ├── speaker0/
84
+ │ ├── 000001.wav
85
+ │ └── ...
86
+ └── speaker1/
87
+ └── ...
88
+ ```
89
+
90
+ Clips should be clean vocals, < 30 seconds, no accompaniment.
91
+ Use UVR for source separation and audio-slicer for cutting.
92
+
93
+ ### 2. Extract features
94
+
95
+ ```bash
96
+ python svc_preprocessing.py -t 2
97
+ ```
98
+
99
+ Produces under `data_svc/`:
100
+ - `whisper/<spk>/<id>.ppg.npy` — Whisper PPG (1280-dim, 50 Hz)
101
+ - `hubert/<spk>/<id>.vec.npy` — HuBERT (256-dim, 50 Hz)
102
+ - `pitch/<spk>/<id>.pit.npy` — F0 in Hz (50 Hz, 0 = unvoiced)
103
+ - `speaker/<spk>/<id>.spk.npy` — Speaker d-vector (256-dim)
104
+
105
+ ### 3. V1 only: extract codec targets
106
+
107
+ ```bash
108
+ python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets
109
+ ```
110
+
111
+ V2 computes mel spectrograms on-the-fly from the raw audio — no offline codec step needed.
112
+
113
+ ---
114
+
115
+ ## V1: CFM-SVC (Training from Scratch)
116
+
117
+ ### Train
118
+
119
+ ```bash
120
+ python train_cfm.py \
121
+ --data_dir ./data_svc/codec_targets \
122
+ --batch_size 64 \
123
+ --lr 2e-5 \
124
+ --epochs 250 \
125
+ --save_interval 1
126
+
127
+ # or via SLURM:
128
+ sbatch submit_train.sh
129
+ ```
130
+
131
+ Training automatically resumes from the latest checkpoint in `chkpt_cfm/`.
132
+
133
+ Key arguments:
134
+
135
+ | Argument | Default | Description |
136
+ |---|---|---|
137
+ | `--lr` | `1e-4` | Learning rate |
138
+ | `--batch_size` | `2` | Batch size |
139
+ | `--grad_accum` | `1` | Gradient accumulation steps |
140
+ | `--grad_clip` | `1.0` | Gradient clip max norm |
141
+ | `--save_interval` | `50` | Save every N epochs |
142
+ | `--use_checkpointing` | off | Enable gradient checkpointing (saves VRAM) |
143
+ | `--freeze_norm` | off | Freeze latent norm stats (for fine-tuning) |
144
+
145
+ ### Inference (V1)
146
+
147
+ ```bash
148
+ python infer.py --wave /path/to/source_singing.wav
149
+ ```
150
+
151
+ ---
152
+
153
+ ## V2: F5-SVC (LoRA on F5-TTS)
154
+
155
+ ### Architecture
156
+
157
+ - F5-TTS's DiT is loaded with pretrained weights and kept mostly frozen.
158
+ - `SVCCondAdapter` replaces the text encoder: PPG + HuBERT + F0 + speaker → (B, T, 512).
159
+ - LoRA (rank 16) is injected into every DiT attention projection (Q, K, V, Out).
160
+ - Vocos decodes mel spectrograms to audio.
161
+ - Two-stage training protocol:
162
+ - **Stage 1** (singing): SVCCondAdapter + LoRA trained on multi-speaker singing data.
163
+ - **Stage 2** (per-speaker): only `spk_proj` trained on the target speaker's speech clips.
164
+
165
+ ### Download F5-TTS checkpoint
166
+
167
+ ```python
168
+ from huggingface_hub import hf_hub_download
169
+ path = hf_hub_download("SWivid/F5-TTS", "F5TTS_Base/model_1200000.safetensors")
170
+ print(path)
171
+ ```
172
+
173
+ ### Stage 1 — Singing Adaptation
174
+
175
+ Trains: `SVCCondAdapter` (content projection + speaker projection) + LoRA adapters
176
+ Freezes: All other DiT weights
177
+
178
+ ```bash
179
+ python train_f5_stage1.py \
180
+ --f5tts_ckpt /path/to/model_1200000.safetensors \
181
+ --audio_dir ./data_svc/audio \
182
+ --epochs 200 \
183
+ --batch_size 16 \
184
+ --lr 1e-4
185
+
186
+ # Checkpoints saved to ./chkpt_f5svc/stage1_epoch_N.pt
187
+ ```
188
+
189
+ All PPG/HuBERT/F0/speaker features from V1 preprocessing are reused directly.
190
+ The only difference is the audio directory name: V1 produces `data_svc/waves-32k/`
191
+ while V2 defaults to `data_svc/audio/`. Pass `--audio_dir ./data_svc/waves-32k` to
192
+ reuse V1 audio (it is resampled to 24 kHz on-the-fly, no re-extraction needed).
193
+ The codec targets directory (`data_svc/codec_targets/`) is V1-only and not needed here.
194
+
195
+ ### Stage 2 — Per-Speaker Fine-tuning
196
+
197
+ Trains: `svc_adapter.spk_proj` only
198
+ Freezes: DiT + LoRA (locked in from Stage 1)
199
+ Data: Speech clips of the target speaker (no singing required)
200
+
201
+ ```bash
202
+ python train_f5_stage2.py \
203
+ --stage1_ckpt ./chkpt_f5svc/stage1_epoch_200.pt \
204
+ --audio_dir ./data_svc/audio/my_speaker \
205
+ --speaker_id my_speaker \
206
+ --epochs 50
207
+
208
+ # Saved to ./chkpt_f5svc/stage2_my_speaker.pt
209
+ ```
210
+
211
+ The target speaker's speech clips need the same feature extraction as Stage 1:
212
+ run `svc_preprocessing.py` pointing at the speech audio directory.
213
+
214
+ ### Inference (V2)
215
+
216
+ ```bash
217
+ python infer_f5_svc.py \
218
+ --ckpt ./chkpt_f5svc/stage1_epoch_200.pt \
219
+ --source ./source_singing.wav \
220
+ --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
221
+ --ref_audio ./data_svc/audio/my_speaker/ref.wav \
222
+ --output ./converted.wav \
223
+ --steps 32
224
+ ```
225
+
226
+ For a Stage 2 speaker-adapted checkpoint:
227
+ ```bash
228
+ python infer_f5_svc.py \
229
+ --ckpt ./chkpt_f5svc/stage2_my_speaker.pt \
230
+ --source ./source_singing.wav \
231
+ --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
232
+ --ref_audio ./data_svc/audio/my_speaker/ref.wav \
233
+ --output ./converted.wav
234
+ ```
235
+
236
+ Inference arguments:
237
+
238
+ | Argument | Default | Description |
239
+ |---|---|---|
240
+ | `--ckpt` | required | Stage 1 or Stage 2 checkpoint |
241
+ | `--source` | required | Source singing .wav |
242
+ | `--target_spk` | required | Target speaker .spk.npy |
243
+ | `--ref_audio` | `None` | Short .wav of target speaker for timbre reference |
244
+ | `--ref_sec` | `3.0` | Seconds of ref_audio to use |
245
+ | `--steps` | `32` | Euler ODE steps (more = higher quality, slower) |
246
+ | `--output` | `./converted.wav` | Output path |
247
+
248
+ The source audio must have pre-extracted features (PPG, HuBERT, F0) in the standard
249
+ `data_svc/` directory structure. Run `svc_preprocessing.py` on the source if needed.
250
+
251
+ ---
252
+
253
+ ## Checkpoints
254
+
255
+ V1 saves full model state per epoch to `chkpt_cfm/`:
256
+ ```
257
+ chkpt_cfm/
258
+ ├── dit_epoch_N.pt
259
+ ├── cond_encoder_epoch_N.pt
260
+ ├── projector_epoch_N.pt
261
+ ├── ema_dit_epoch_N.pt
262
+ ├── optimizer_epoch_N.pt
263
+ ├── scheduler_epoch_N.pt
264
+ └── latent_norm.pt ← cached normalization stats
265
+ ```
266
+
267
+ V2 saves adapter + LoRA state per epoch to `chkpt_f5svc/`:
268
+ ```
269
+ chkpt_f5svc/
270
+ ├── stage1_epoch_N.pt ← full model state (adapter + LoRA + frozen DiT)
271
+ │ also contains lora_only key for lightweight sharing
272
+ └── stage2_<speaker_id>.pt ← speaker-adapted state
273
+ ```
274
+
275
+ ---
276
+
277
+ ## References
278
+
279
+ - Rectified Flow / Flow Matching
280
+ - F5-TTS: [SWivid/F5-TTS](https://github.com/SWivid/F5-TTS)
281
+ - Vocos vocoder: [hubert-whisper/vocos](https://github.com/hubert-whisper/vocos)
282
+ - DAC: [descriptinc/descript-audio-codec](https://github.com/descriptinc/descript-audio-codec)
283
+ - so-vits-svc-5.0: preprocessing pipeline
README_OLD.md ADDED
@@ -0,0 +1,382 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <h1> Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS </h1>
3
+
4
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/sovits5.0)
5
+ <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PlayVoice/so-vits-svc-5.0">
6
+ <img alt="GitHub forks" src="https://img.shields.io/github/forks/PlayVoice/so-vits-svc-5.0">
7
+ <img alt="GitHub issues" src="https://img.shields.io/github/issues/PlayVoice/so-vits-svc-5.0">
8
+ <img alt="GitHub" src="https://img.shields.io/github/license/PlayVoice/so-vits-svc-5.0">
9
+
10
+ </div>
11
+
12
+ - This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
13
+ - This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
14
+ - This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
15
+ - This project will not develop one-click packages for other purposes;
16
+
17
+ ![vits-5.0-frame](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/3854b281-8f97-4016-875b-6eb663c92466)
18
+
19
+ - 6GB low minimum VRAM requirement for training
20
+
21
+ - support for multiple speakers
22
+
23
+ - create unique speakers through speaker mixing
24
+
25
+ - even voices with light accompaniment can also be converted
26
+
27
+ - F0 can be edited using Excel
28
+
29
+ https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/6a09805e-ab93-47fe-9a14-9cbc1e0e7c3a
30
+
31
+ Powered by [@ShadowVap](https://space.bilibili.com/491283091)
32
+
33
+ ## Model properties
34
+
35
+ | Feature | From | Status | Function |
36
+ | :--- | :--- | :--- | :--- |
37
+ | whisper | OpenAI | ✅ | strong noise immunity |
38
+ | bigvgan | NVIDA | ✅ | alias and snake | The formant is clearer and the sound quality is obviously improved |
39
+ | natural speech | Microsoft | ✅ | reduce mispronunciation |
40
+ | neural source-filter | NII | ✅ | solve the problem of audio F0 discontinuity |
41
+ | speaker encoder | Google | ✅ | Timbre Encoding and Clustering |
42
+ | GRL for speaker | Ubisoft |✅ | Preventing Encoder Leakage Timbre |
43
+ | SNAC | Samsung | ✅ | One Shot Clone of VITS |
44
+ | SCLN | Microsoft | ✅ | Improve Clone |
45
+ | PPG perturbation | this project | ✅ | Improved noise immunity and de-timbre |
46
+ | HuBERT perturbation | this project | ✅ | Improved noise immunity and de-timbre |
47
+ | VAE perturbation | this project | ✅ | Improve sound quality |
48
+ | MIX encoder | this project | ✅ | Improve conversion stability |
49
+ | USP infer | this project | ✅ | Improve conversion stability |
50
+
51
+ due to the use of data perturbation, it takes longer to train than other projects.
52
+
53
+ **USP : Unvoice and Silence with Pitch when infer**
54
+ ![vits_svc_usp](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/ba733b48-8a89-4612-83e0-a0745587d150)
55
+
56
+ ## Quick Installation
57
+
58
+ ```PowerShell
59
+ # clone project
60
+ git clone https://github.com/ouor/so-vits-svc-5.0
61
+
62
+ # create virtual environment
63
+ python -m venv .venv
64
+
65
+ # activate virtual environment
66
+ .venv\Scripts\activate
67
+
68
+ # install pytorch
69
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
70
+
71
+ # install dependencies
72
+ pip install -r requirements.txt
73
+
74
+ # run app.py
75
+ python app.py
76
+ ```
77
+
78
+ ## Setup Environment
79
+
80
+ 1. Install [PyTorch](https://pytorch.org/get-started/locally/).
81
+
82
+ 2. Install project dependencies
83
+ ```shell
84
+ pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
85
+ ```
86
+ **Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error**
87
+ 3. Download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar` into `speaker_pretrain/`.
88
+
89
+ 4. Download whisper model [whisper-large-v2](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt). Make sure to download `large-v2.pt`,put it into `whisper_pretrain/`.
90
+
91
+ 5. Download [hubert_soft model](https://github.com/bshall/hubert/releases/tag/v0.1),put `hubert-soft-0d54a1f4.pt` into `hubert_pretrain/`.
92
+
93
+ 6. Download pitch extractor [crepe full](https://github.com/maxrmorrison/torchcrepe/tree/master/torchcrepe/assets),put `full.pth` into `crepe/assets`.
94
+
95
+ 7. Download pretrain model [sovits5.0.pretrain.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/5.0/), and put it into `vits_pretrain/`.
96
+ ```shell
97
+ python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
98
+ ```
99
+
100
+ ## Dataset preparation
101
+
102
+ Necessary pre-processing:
103
+ 1. Separate vocie and accompaniment with [UVR](https://github.com/Anjok07/ultimatevocalremovergui) (skip if no accompaniment)
104
+ 2. Cut audio input to shorter length with [slicer](https://github.com/flutydeer/audio-slicer), whisper takes input less than 30 seconds.
105
+ 3. Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
106
+ 4. Adjust loudness if necessary, recommand Adobe Audiiton.
107
+ 5. Put the dataset into the `dataset_raw` directory following the structure below.
108
+ ```
109
+ dataset_raw
110
+ ├───speaker0
111
+ │ ├───000001.wav
112
+ │ ├───...
113
+ │ └───000xxx.wav
114
+ └───speaker1
115
+ ├───000001.wav
116
+ ├───...
117
+ └───000xxx.wav
118
+ ```
119
+
120
+ ## Data preprocessing
121
+ ```shell
122
+ python sve_preprocessing.py -t 2
123
+ ```
124
+ `-t`: threading, max number should not exceed CPU core count, usually 2 is enough.
125
+ After preprocessing you will get an output with following structure.
126
+ ```
127
+ data_svc/
128
+ └── waves-16k
129
+ │ └── speaker0
130
+ │ │ ├── 000001.wav
131
+ │ │ └── 000xxx.wav
132
+ │ └── speaker1
133
+ │ ├── 000001.wav
134
+ │ └── 000xxx.wav
135
+ └── waves-32k
136
+ │ └── speaker0
137
+ │ │ ├── 000001.wav
138
+ │ │ └── 000xxx.wav
139
+ │ └── speaker1
140
+ │ ├── 000001.wav
141
+ │ └── 000xxx.wav
142
+ └── pitch
143
+ │ └── speaker0
144
+ │ │ ├── 000001.pit.npy
145
+ │ │ └── 000xxx.pit.npy
146
+ │ └── speaker1
147
+ │ ├── 000001.pit.npy
148
+ │ └── 000xxx.pit.npy
149
+ └── hubert
150
+ │ └── speaker0
151
+ │ │ ├── 000001.vec.npy
152
+ │ │ └── 000xxx.vec.npy
153
+ │ └── speaker1
154
+ │ ├── 000001.vec.npy
155
+ │ └── 000xxx.vec.npy
156
+ └── whisper
157
+ │ └── speaker0
158
+ │ │ ├── 000001.ppg.npy
159
+ │ │ └── 000xxx.ppg.npy
160
+ │ └── speaker1
161
+ │ ├── 000001.ppg.npy
162
+ │ └── 000xxx.ppg.npy
163
+ └── speaker
164
+ │ └── speaker0
165
+ │ │ ├── 000001.spk.npy
166
+ │ │ └── 000xxx.spk.npy
167
+ │ └── speaker1
168
+ │ ├── 000001.spk.npy
169
+ │ └── 000xxx.spk.npy
170
+ └── singer
171
+ ├── speaker0.spk.npy
172
+ └── speaker1.spk.npy
173
+ ```
174
+
175
+ 1. Re-sampling
176
+ - Generate audio with a sampling rate of 16000Hz in `./data_svc/waves-16k`
177
+ ```
178
+ python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
179
+ ```
180
+
181
+ - Generate audio with a sampling rate of 32000Hz in `./data_svc/waves-32k`
182
+ ```
183
+ python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
184
+ ```
185
+ 2. Use 16K audio to extract pitch
186
+ ```
187
+ python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
188
+ ```
189
+ 3. Use 16K audio to extract ppg
190
+ ```
191
+ python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
192
+ ```
193
+ 4. Use 16K audio to extract hubert
194
+ ```
195
+ python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
196
+ ```
197
+ 5. Use 16k audio to extract timbre code
198
+ ```
199
+ python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
200
+ ```
201
+ 6. Extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
202
+ ```
203
+ python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
204
+ ```
205
+ 7. use 32k audio to extract the linear spectrum
206
+ ```
207
+ python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
208
+ ```
209
+ 8. Use 32k audio to generate training index
210
+ ```
211
+ python prepare/preprocess_train.py
212
+ ```
213
+ 11. Training file debugging
214
+ ```
215
+ python prepare/preprocess_zzz.py
216
+ ```
217
+
218
+ ## Train
219
+ 1. If fine-tuning based on the pre-trained model, you need to download the pre-trained model: [sovits5.0.pretrain.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/5.0). Put pretrained model under project root, change this line
220
+ ```
221
+ pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
222
+ ```
223
+ in `configs/base.yaml`,and adjust the learning rate appropriately, eg 5e-5.
224
+
225
+ `batch_szie`: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower.
226
+ 2. Start training
227
+ ```
228
+ python svc_trainer.py -c configs/base.yaml -n sovits5.0
229
+ ```
230
+ 3. Resume training
231
+ ```
232
+ python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth
233
+ ```
234
+ 4. Log visualization
235
+ ```
236
+ tensorboard --logdir logs/
237
+ ```
238
+
239
+ ![sovits5 0_base](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/1628e775-5888-4eac-b173-a28dca978faa)
240
+
241
+ ![sovits_spec](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/c4223cf3-b4a0-4325-bec0-6d46d195a1fc)
242
+
243
+ ## Inference
244
+
245
+ 1. Export inference model: text encoder, Flow network, Decoder network
246
+ ```
247
+ python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
248
+ ```
249
+ 2. Inference
250
+ - if there is no need to adjust `f0`, just run the following command.
251
+ ```
252
+ python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
253
+ ```
254
+ - if `f0` will be adjusted manually, follow the steps:
255
+ 1. use whisper to extract content encoding, generate `test.vec.npy`.
256
+ ```
257
+ python whisper/inference.py -w test.wav -p test.ppg.npy
258
+ ```
259
+ 2. use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage
260
+ ```
261
+ python hubert/inference.py -w test.wav -v test.vec.npy
262
+ ```
263
+ 3. extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
264
+ ```
265
+ python pitch/inference.py -w test.wav -p test.csv
266
+ ```
267
+ 4. final inference
268
+ ```
269
+ python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0
270
+ ```
271
+ 3. Notes
272
+
273
+ - when `--ppg` is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
274
+
275
+ - when `--vec` is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
276
+
277
+ - when `--pit` is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
278
+
279
+ - generate files in the current directory:svc_out.wav
280
+
281
+ 4. Arguments ref
282
+
283
+ | args |--config | --model | --spk | --wave | --ppg | --vec | --pit | --shift |
284
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
285
+ | name | config path | model path | speaker | wave input | wave ppg | wave hubert | wave pitch | pitch shift |
286
+
287
+ ## Creat singer
288
+ named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction
289
+
290
+ ```
291
+ python svc_eva.py
292
+ ```
293
+
294
+ ```python
295
+ eva_conf = {
296
+ './configs/singers/singer0022.npy': 0,
297
+ './configs/singers/singer0030.npy': 0,
298
+ './configs/singers/singer0047.npy': 0.5,
299
+ './configs/singers/singer0051.npy': 0.5,
300
+ }
301
+ ```
302
+
303
+ the generated singer file will be `eva.spk.npy`.
304
+
305
+ ## Data set
306
+
307
+ | Name | URL |
308
+ | :--- | :--- |
309
+ |KiSing |http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/|
310
+ |PopCS |https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md|
311
+ |opencpop |https://wenet.org.cn/opencpop/download/|
312
+ |Multi-Singer |https://github.com/Multi-Singer/Multi-Singer.github.io|
313
+ |M4Singer |https://github.com/M4Singer/M4Singer/blob/master/apply_form.md|
314
+ |CSD |https://zenodo.org/record/4785016#.YxqrTbaOMU4|
315
+ |KSS |https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset|
316
+ |JVS MuSic |https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music|
317
+ |PJS |https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus|
318
+ |JUST Song |https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song|
319
+ |MUSDB18 |https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems|
320
+ |DSD100 |https://sigsep.github.io/datasets/dsd100.html|
321
+ |Aishell-3 |http://www.aishelltech.com/aishell_3|
322
+ |VCTK |https://datashare.ed.ac.uk/handle/10283/2651|
323
+
324
+ ## Code sources and references
325
+
326
+ https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)
327
+
328
+ https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103)
329
+
330
+ https://github.com/openai/whisper/ [paper](https://arxiv.org/abs/2212.04356)
331
+
332
+ https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658)
333
+
334
+ https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889)
335
+
336
+ https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
337
+
338
+ https://github.com/brentspell/hifi-gan-bwe
339
+
340
+ https://github.com/mozilla/TTS
341
+
342
+ https://github.com/bshall/soft-vc
343
+
344
+ https://github.com/maxrmorrison/torchcrepe
345
+
346
+ https://github.com/OlaWod/FreeVC [paper](https://arxiv.org/abs/2210.15418)
347
+
348
+ [SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech](https://github.com/hcy71o/SNAC)
349
+
350
+ [Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585)
351
+
352
+ [AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf)
353
+
354
+ [Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis](https://github.com/ubisoft/ubisoft-laforge-daft-exprt)
355
+
356
+ [Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings](https://arxiv.org/abs/2305.05401)
357
+
358
+ [Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion](https://arxiv.org/pdf/2305.09167.pdf)
359
+
360
+ [Speaker normalization (GRL) for self-supervised speech emotion recognition](https://arxiv.org/abs/2202.01252)
361
+
362
+ ## Method of Preventing Timbre Leakage Based on Data Perturbation
363
+
364
+ https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
365
+
366
+ https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
367
+
368
+ https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
369
+
370
+ https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
371
+
372
+ https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
373
+
374
+ ## Contributors
375
+
376
+ <a href="https://github.com/PlayVoice/so-vits-svc/graphs/contributors">
377
+ <img src="https://contrib.rocks/image?repo=PlayVoice/so-vits-svc" />
378
+ </a>
379
+
380
+ ## Relevant Projects
381
+ - [LoRA-SVC](https://github.com/PlayVoice/lora-svc): decoder only svc
382
+ - [NSF-BigVGAN](https://github.com/PlayVoice/NSF-BigVGAN): vocoder for more work
README_V1.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on CFM
2
+
3
+ This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project. This project implements a highly modular, mathematically rigorous Continuous Normalizing Flow (CFM) based Singing Voice Conversion (SVC) system using a pretrained codec and learned projection (Option C*).
4
+
5
+ By replacing the VITS/VAE monoliths with a Diffusion Transformer (DiT) and an explicit codebook projector, we achieve stronger temporal dependency modeling and faster, more stable training without the overhead of learning an autoencoder from scratch.
6
+
7
+ ## Architecture Highlights
8
+ - **Frozen Pretrained Codec**: Uses a pretrained neural codec (e.g., DAC 44KHz) purely for encoding and decoding, freezing its weights to save VRAM.
9
+ - **Offline Data Processing**: `z_target` latents are extracted once before training, preventing massive CPU/GPU bottlenecks in dataloaders.
10
+ - **Diffusion Transformer (DiT)**: Velocity field prediction $v_\theta$ uses a DiT instead of 1D U-Nets for state-of-the-art long-sequence audio modeling.
11
+ - **Dual-Loss Formulation with Implied Targets**: Solves the mathematical trap of backpropagating through an ODE solver during training. Calculates projection commitments instantly via the target velocity.
12
+
13
+ ## Quick Installation
14
+
15
+ ```bash
16
+ # clone project
17
+ git clone https://github.com/ouor/so-vits-svc-5.0
18
+
19
+ # create virtual environment
20
+ python -m venv .venv
21
+
22
+ # activate virtual environment
23
+ .venv\Scripts\activate
24
+
25
+ # install pytorch
26
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
27
+
28
+ # install dependencies
29
+ pip install -r requirements.txt
30
+ pip install descript-audio-codec
31
+
32
+ # run app.py (Gradio UI)
33
+ python ui_cfm.py
34
+ ```
35
+
36
+ ## Setup Environment
37
+
38
+ - Download the Timbre Encoder: Speaker-Encoder by @mueller91, put `best_model.pth.tar` into `speaker_pretrain/`.
39
+ - Download whisper model whisper-large-v2. Make sure to download `large-v2.pt`, put it into `whisper_pretrain/`.
40
+ - Download hubert_soft model, put `hubert-soft-0d54a1f4.pt` into `hubert_pretrain/`.
41
+ - Download pitch extractor crepe full, put `full.pth` into `crepe/assets`.
42
+
43
+ ## Dataset preparation
44
+
45
+ Necessary pre-processing:
46
+ 1. Separate voice and accompaniment with UVR (skip if no accompaniment).
47
+ 2. Cut audio input to shorter length with slicer (< 30s).
48
+ 3. Put the dataset into the `dataset_raw` directory following the structure below.
49
+
50
+ ```
51
+ dataset_raw
52
+ ├───speaker0
53
+ │ ├───000001.wav
54
+ │ └───000xxx.wav
55
+ └───speaker1
56
+ ├───000001.wav
57
+ └───000xxx.wav
58
+ ```
59
+
60
+ ## Data preprocessing (Offline Shift)
61
+
62
+ Unlike traditional VAE-based SVC which handles encoding in the dataloader, this pipeline pre-extracts both conditioning and quantized continuous vectors to save GPU resources.
63
+
64
+ 1. **Standard Extractors**: Extract PPG (Whisper), F0 (Crepe), and Speaker embeddings into their respective `data_svc/` folders:
65
+ ```bash
66
+ python svc_preprocessing.py -t 2
67
+ ```
68
+
69
+ 2. **Codec Targets Extraction**: Run the new offline generation script to pass all waveforms through the frozen codec and cache `z_target` tensors.
70
+ ```bash
71
+ python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets
72
+ ```
73
+
74
+ ## Train
75
+
76
+ You will jointly train the DiT velocity network $v_\theta$ and the lightweight projection network $P(u)$. The heavy codec encoder/decoder remains entirely offline.
77
+
78
+ ```bash
79
+ # Start Training
80
+ python train_cfm.py
81
+ ```
82
+ *The training script utilizes the dual-loss schema (Flow matching MSE + Projector Commitment MSE) utilizing the implicit velocity targets rather than integrating an ODE. Models will automatically save to the `chkpt/` folder.*
83
+
84
+ ## Inference
85
+
86
+ The inference pipeline extracts conditioning, samples the continuous spatial latent using your preferred ODE solver (Euler, Heun, RK4), snaps the sample back to codebook space using the projector, and finally decodes the waveform via the DAC codec. **Long audio inputs will automatically chunk into 30s segments to avoid VRAM overflow.**
87
+
88
+ ```bash
89
+ # Run Inference
90
+ python infer.py --wave /path/to/your/input.wav
91
+ ```
92
+
93
+ ### Notes on Inference Pipeline Components:
94
+ - The **ODE Solver** (`samplers/ode.py`) is modular. You can configure solver steps and methods (`solver='rk4'`) based on your quality-vs-speed needs.
95
+ - **Temporal Resampling** is handled automatically in `models/cond_encoder.py`, perfectly matching Whisper and Crepe conditionings to the target codec's continuous latent frame sequence length.
96
+
97
+ ## Code sources and references
98
+
99
+ - Rectified Flow / Flow Matching literature
100
+ - Diffusion Transformers (DiT) based on [Peebles & Xie, 2022]
101
+ - Neural Audio Codecs (DAC / EnCodec)
102
+ - so-vits-svc-5.0 original repository components extracted for preprocessing
app.py ADDED
@@ -0,0 +1,356 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import subprocess
3
+ import yaml
4
+ import sys
5
+ import webbrowser
6
+ import gradio as gr
7
+ import shutil
8
+ import soundfile
9
+ import shlex
10
+
11
+ class WebUI:
12
+ def __init__(self):
13
+ self.train_config_path = 'configs/train.yaml'
14
+ self.info = Info()
15
+ self.names = []
16
+ self.names2 = []
17
+ self.voice_names = []
18
+ base_config_path = 'configs/base.yaml'
19
+ if not os.path.exists(self.train_config_path):
20
+ shutil.copyfile(base_config_path, self.train_config_path)
21
+ print("초기화 성공")
22
+ else:
23
+ print("준비됨")
24
+ self.main_ui()
25
+
26
+ def main_ui(self):
27
+ with gr.Blocks(theme=gr.themes.Base(primary_hue=gr.themes.colors.green)) as ui:
28
+ gr.Markdown('# so-vits-svc5.0 WebUI')
29
+
30
+ with gr.Tab("학습"):
31
+ with gr.Accordion('학습 안내', open=False):
32
+ gr.Markdown(self.info.train)
33
+
34
+ gr.Markdown('### 데이터셋 파일 복사')
35
+ with gr.Row():
36
+ self.dataset_name = gr.Textbox(value='', placeholder='chopin', label='데이터셋 이름', info='데이터셋 화자의 이름을 입력하세요.', interactive=True)
37
+ self.dataset_src = gr.Textbox(value='', placeholder='C:/Users/Tacotron2/Downloads/chopin_dataset/', label='데이터셋 폴더', info='데이터셋 wav 파일이 있는 폴더를 지정하세요.', interactive=True)
38
+ self.bt_dataset_copy = gr.Button(value='복사', variant="primary")
39
+
40
+ gr.Markdown('### 전처리 파라미터 설정')
41
+ with gr.Row():
42
+ self.model_name = gr.Textbox(value='sovits5.0', label='model', info='모델명', interactive=True)
43
+ self.f0_extractor = gr.Dropdown(choices=['crepe'], value='crepe', label='f0_extractor', info='F0 추출기', interactive=True)
44
+ self.thread_count = gr.Slider(minimum=1, maximum=os.cpu_count(), step=1, value=2, label='thread_count', info='전처리 스레드 수', interactive=True)
45
+
46
+ gr.Markdown('### 학습 파라미터 설정')
47
+ with gr.Row():
48
+ self.learning_rate = gr.Number(value=5e-5, label='learning_rate', info='학습률', interactive=True)
49
+ self.batch_size = gr.Slider(minimum=1, maximum=50, step=1, value=6, label='batch_size', info='배치 크기', interactive=True)
50
+ self.epochs = gr.Textbox(value='100', label='epoch', info='학습 에포크 수', interactive=True)
51
+ with gr.Row():
52
+ self.info_interval = gr.Number(value=50, label='info_interval', info='학습 로깅 간격(step}', interactive=True)
53
+ self.eval_interval = gr.Number(value=1, label='eval_interval', info='검증 세트 간격(epoch}', interactive=True)
54
+ self.save_interval = gr.Number(value=5, label='save_interval', info='체크포인트 저장 간격(epoch}', interactive=True)
55
+ self.keep_ckpts = gr.Number(value=5, label='keep_ckpts', info='최신 체크포인트 파일 유지 갯수(0은 모두 저장)',interactive=True)
56
+ with gr.Row():
57
+ self.use_pretrained = gr.Checkbox(label="use_pretrained", info='사전학습모델 사용 여부', value=True, interactive=True, visible=False)
58
+
59
+ gr.Markdown('### 학습 시작')
60
+ with gr.Row():
61
+ self.bt_open_dataset_folder = gr.Button(value='데이터 세트 폴더 열기')
62
+ self.bt_onekey_train = gr.Button('원클릭 학습 시작', variant="primary")
63
+ self.bt_tb = gr.Button('Tensorboard 열기', variant="primary")
64
+
65
+ gr.Markdown('### 학습 재개')
66
+ with gr.Row():
67
+ self.resume_model = gr.Dropdown(choices=sorted(self.names), label='Resume training progress from checkpoints', info='체크포인트에서 학습 진행 재개', interactive=True)
68
+ with gr.Column():
69
+ self.bt_refersh = gr.Button('새로 고침')
70
+ self.bt_resume_train = gr.Button('학습 재개', variant="primary")
71
+
72
+ with gr.Tab("추론"):
73
+
74
+ with gr.Accordion('추론 안내', open=False):
75
+ gr.Markdown(self.info.inference)
76
+
77
+ gr.Markdown('### 추론 파라미터 설정')
78
+ with gr.Row():
79
+ with gr.Column():
80
+ self.keychange = gr.Slider(-12, 12, value=0, step=1, label='음높이 조절')
81
+ self.file_list = gr.Markdown(value="", label="파일 목록")
82
+
83
+ with gr.Row():
84
+ self.resume_model2 = gr.Dropdown(choices=sorted(self.names2), label='Select the model you want to export',
85
+ info='내보낼 모델 선택', interactive=True)
86
+ with gr.Column():
87
+ self.bt_refersh2 = gr.Button(value='모델 및 사운드 새로 고침')
88
+ self.bt_out_model = gr.Button(value='모델 내보내기', variant="primary")
89
+ with gr.Row():
90
+ self.resume_voice = gr.Dropdown(choices=sorted(self.voice_names), label='Select the sound file',
91
+ info='*.spk.npy 파일 선택', interactive=True)
92
+ with gr.Row():
93
+ self.input_wav = gr.Audio(type='filepath', label='변환할 오디오 선택', source='upload')
94
+ with gr.Row():
95
+ self.bt_infer = gr.Button(value='변환 시작', variant="primary")
96
+ with gr.Row():
97
+ self.output_wav = gr.Audio(label='출력 오디오', interactive=False)
98
+
99
+ self.bt_dataset_copy.click(fn=self.copydataset, inputs=[self.dataset_name, self.dataset_src])
100
+ self.bt_open_dataset_folder.click(fn=self.openfolder)
101
+ self.bt_onekey_train.click(fn=self.onekey_training,inputs=[self.model_name, self.thread_count,self.learning_rate,self.batch_size, self.epochs, self.info_interval, self.eval_interval,self.save_interval, self.keep_ckpts, self.use_pretrained])
102
+ self.bt_out_model.click(fn=self.out_model, inputs=[self.model_name, self.resume_model2])
103
+ self.bt_tb.click(fn=self.tensorboard)
104
+ self.bt_refersh.click(fn=self.refresh_model, inputs=[self.model_name], outputs=[self.resume_model])
105
+ self.bt_resume_train.click(fn=self.resume_train, inputs=[self.model_name, self.resume_model, self.epochs])
106
+ self.bt_infer.click(fn=self.inference, inputs=[self.input_wav, self.resume_voice, self.keychange], outputs=[self.output_wav])
107
+ self.bt_refersh2.click(fn=self.refresh_model_and_voice, inputs=[self.model_name],outputs=[self.resume_model2, self.resume_voice])
108
+
109
+ ui.launch(inbrowser=True)
110
+
111
+ def copydataset(self, dataset_name, dataset_src):
112
+ assert dataset_name != '', '데이터셋 이름을 입력하세요'
113
+ assert dataset_src != '', '데이터셋 경로를 입력하세요'
114
+ assert os.path.isdir(dataset_src), '데이터셋 경로가 잘못되었습니다'
115
+ from glob import glob
116
+ wav_files = glob(os.path.join(dataset_src, '*.wav'))
117
+ assert len(wav_files) > 0, '데이터셋 경로에 wav 파일이 없습니다'
118
+
119
+ import shutil
120
+ dst_dir = os.path.join('dataset_raw', dataset_name)
121
+ if not os.path.exists(dst_dir): os.makedirs(dst_dir, exist_ok=True)
122
+ for wav_file in wav_files:
123
+ shutil.copy(wav_file, dst_dir)
124
+ print('데이터셋 복사 완료')
125
+
126
+ def openfolder(self):
127
+ if not os.path.exists('dataset_raw'): os.makedirs('dataset_raw', exist_ok=True)
128
+ try:
129
+ if sys.platform.startswith('win'):
130
+ os.startfile('dataset_raw')
131
+ elif sys.platform.startswith('linux'):
132
+ subprocess.call(['xdg-open', 'dataset_raw'])
133
+ elif sys.platform.startswith('darwin'):
134
+ subprocess.call(['open', 'dataset_raw'])
135
+ else:
136
+ print('폴더를 열지 못했습니다!')
137
+ except BaseException:
138
+ print('폴더를 열지 못했습니다!')
139
+
140
+ def preprocessing(self, thread_count):
141
+ print('전처리 시작')
142
+ train_process = subprocess.Popen(f'{sys.executable} -u svc_preprocessing.py -t {str(thread_count)}', stdout=subprocess.PIPE)
143
+ while train_process.poll() is None:
144
+ output = train_process.stdout.readline().decode('utf-8')
145
+ print(output, end='')
146
+
147
+ def create_config(self, model_name, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval,
148
+ keep_ckpts, use_pretrained):
149
+ with open("configs/train.yaml", "r") as f:
150
+ config = yaml.load(f, Loader=yaml.FullLoader)
151
+ config['train']['model'] = model_name
152
+ config['train']['learning_rate'] = learning_rate
153
+ config['train']['batch_size'] = batch_size
154
+ config['train']['epochs'] = int(epochs)
155
+ config["log"]["info_interval"] = int(info_interval)
156
+ config["log"]["eval_interval"] = int(eval_interval)
157
+ config["log"]["save_interval"] = int(save_interval)
158
+ config["log"]["keep_ckpts"] = int(keep_ckpts)
159
+ if use_pretrained:
160
+ config["train"]["pretrain"] = "vits_pretrain/sovits5.0.pretrain.pth"
161
+ else:
162
+ config["train"]["pretrain"] = ""
163
+ with open("configs/train.yaml", "w") as f:
164
+ yaml.dump(config, f)
165
+ return f"로그 파라미터를 다음으로 업데이트했습니다.{config['log']}"
166
+
167
+ def training(self, model_name):
168
+ print('학습 시작')
169
+ print('학습을 수행하는 새로운 콘솔 창이 열립니다.')
170
+ print('학습 도중 학습을 중지하려���, 콘솔 창을 닫으세요.')
171
+ train_process = subprocess.Popen(f'{sys.executable} -u svc_trainer.py -c {self.train_config_path} -n {str(model_name)}', stdout=subprocess.PIPE, creationflags=subprocess.CREATE_NEW_CONSOLE)
172
+ while train_process.poll() is None:
173
+ output = train_process.stdout.readline().decode('utf-8')
174
+ print(output, end='')
175
+
176
+ def onekey_training(self, model_name, thread_count, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval, keep_ckpts, use_pretrained):
177
+ print(model_name, thread_count, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval, keep_ckpts)
178
+ self.create_config(model_name, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval, keep_ckpts, use_pretrained)
179
+ self.preprocessing(thread_count)
180
+ self.training(model_name)
181
+
182
+ def out_model(self, model_name, resume_model2):
183
+ print('모델 내보내기 시작')
184
+ try:
185
+ subprocess.Popen(f'{sys.executable} -u svc_export.py -c {self.train_config_path} -p "chkpt/{model_name}/{resume_model2}"',stdout=subprocess.PIPE)
186
+ print('모델 내보내기 성공')
187
+ except Exception as e:
188
+ print("에러 발생함:", e)
189
+
190
+
191
+ def tensorboard(self):
192
+ tensorboard_path = os.path.join(os.path.dirname(sys.executable), 'Scripts', 'tensorboard.exe')
193
+ print(tensorboard_path)
194
+ tb_process = subprocess.Popen(f'{tensorboard_path} --logdir=logs --port=6006', stdout=subprocess.PIPE, creationflags=subprocess.CREATE_NEW_CONSOLE)
195
+ webbrowser.open("http://localhost:6006")
196
+
197
+ while tb_process.poll() is None:
198
+ output = tb_process.stdout.readline().decode('utf-8')
199
+ print(output)
200
+
201
+ def refresh_model(self, model_name):
202
+ self.script_dir = os.path.dirname(os.path.abspath(__file__))
203
+ self.model_root = os.path.join(self.script_dir, f"chkpt/{model_name}")
204
+ self.names = []
205
+ try:
206
+ for self.name in os.listdir(self.model_root):
207
+ if self.name.endswith(".pt"):
208
+ self.names.append(self.name)
209
+ return {"choices": sorted(self.names), "__type__": "update"}
210
+ except FileNotFoundError:
211
+ return {"label": "모델 파일 누락", "__type__": "update"}
212
+
213
+ def refresh_model2(self, model_name):
214
+ self.script_dir = os.path.dirname(os.path.abspath(__file__))
215
+ self.model_root = os.path.join(self.script_dir, f"chkpt/{model_name}")
216
+ self.names2 = []
217
+ try:
218
+ for self.name in os.listdir(self.model_root):
219
+ if self.name.endswith(".pt"):
220
+ self.names2.append(self.name)
221
+ return {"choices": sorted(self.names2), "__type__": "update"}
222
+ except FileNotFoundError as e:
223
+ return {"label": "모델 파일 누락", "__type__": "update"}
224
+
225
+ def refresh_voice(self):
226
+ self.script_dir = os.path.dirname(os.path.abspath(__file__))
227
+ self.model_root = os.path.join(self.script_dir, "data_svc/singer")
228
+ self.voice_names = []
229
+ for self.name in os.listdir(self.model_root):
230
+ if self.name.endswith(".npy"):
231
+ self.voice_names.append(self.name)
232
+ return {"choices": sorted(self.voice_names), "__type__": "update"}
233
+
234
+ def refresh_model_and_voice(self, model_name):
235
+ model_update = self.refresh_model2(model_name)
236
+ voice_update = self.refresh_voice()
237
+ return model_update, voice_update
238
+
239
+ def resume_train(self, model_name, resume_model, epochs):
240
+ print('학습 재개')
241
+ with open("configs/train.yaml", "r") as f:
242
+ config = yaml.load(f, Loader=yaml.FullLoader)
243
+ config['epochs'] = epochs
244
+ with open("configs/train.yaml", "w") as f:
245
+ yaml.dump(config, f)
246
+ train_process = subprocess.Popen(f'{sys.executable} -u svc_trainer.py -c {self.train_config_path} -n {model_name} -p "chkpt/{model_name}/{resume_model}"', stdout=subprocess.PIPE, creationflags=subprocess.CREATE_NEW_CONSOLE)
247
+ while train_process.poll() is None:
248
+ output = train_process.stdout.readline().decode('utf-8')
249
+ print(output, end='')
250
+
251
+ def inference(self, input, resume_voice, keychange):
252
+ if os.path.isfile('test.wav'): os.remove('test.wav')
253
+ self.train_config_path = 'configs/train.yaml'
254
+ print('추론 시작')
255
+ shutil.copy(input, ".")
256
+ input_name = os.path.basename(input)
257
+ os.rename(input_name, "test.wav")
258
+ input_name = "test.wav"
259
+ if not input_name.endswith(".wav"):
260
+ data, samplerate = soundfile.read(input_name)
261
+ input_name = input_name.rsplit(".", 1)[0] + ".wav"
262
+ soundfile.write(input_name, data, samplerate)
263
+ train_config_path = shlex.quote(self.train_config_path)
264
+ keychange = shlex.quote(str(keychange))
265
+ cmd = [f'{sys.executable}', "-u", "svc_inference.py", "--config", train_config_path, "--model", "sovits5.0.pth", "--spk",
266
+ f"data_svc/singer/{resume_voice}", "--wave", "test.wav", "--shift", keychange, '--clean']
267
+ train_process = subprocess.run(cmd, shell=False, capture_output=True, text=True)
268
+ print(train_process.stdout)
269
+ print(train_process.stderr)
270
+ print("추론 성공")
271
+ return "svc_out.wav"
272
+
273
+
274
+ class Info:
275
+ def __init__(self) -> None:
276
+ self.train = '''
277
+ ### 2023.7.11\n
278
+ @OOPPEENN(https://github.com/OOPPEENN)第一次编写\n
279
+ @thestmitsuk(https://github.com/thestmitsuki)二次补完\n
280
+ @OOPPEENN(https://github.com/OOPPEENN)is written for the first time\n
281
+ @thestmitsuki(https://github.com/thestmitsuki)Secondary completion
282
+
283
+ '''
284
+ self.inference = '''
285
+ ### 2023.7.11\n
286
+ @OOPPEENN(https://github.com/OOPPEENN)第一次编写\n
287
+ @thestmitsuk(https://github.com/thestmitsuki)二次补完\n
288
+ @OOPPEENN(https://github.com/OOPPEENN)is written for the first time\n
289
+ @thestmitsuki(https://github.com/thestmitsuki)Secondary completion
290
+
291
+ '''
292
+
293
+ def check_pretrained():
294
+ links = {
295
+ 'hubert_pretrain/hubert-soft-0d54a1f4.pt': 'https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt',
296
+ 'speaker_pretrain/best_model.pth.tar': 'https://drive.google.com/uc?id=1UPjQ2LVSIt3o-9QMKMJcdzT8aZRZCI-E',
297
+ 'speaker_pretrain/config.json': 'https://raw.githubusercontent.com/PlayVoice/so-vits-svc-5.0/9d415f9d7c7c7a131b89ec6ff633be10739f41ed/speaker_pretrain/config.json',
298
+ 'whisper_pretrain/large-v2.pt': 'https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt',
299
+ 'crepe/assets/full.pth': 'https://github.com/maxrmorrison/torchcrepe/raw/master/torchcrepe/assets/full.pth',
300
+ 'vits_pretrain/sovits5.0.pretrain.pth': 'https://github.com/PlayVoice/so-vits-svc-5.0/releases/download/5.0/sovits5.0.pretrain.pth',
301
+ }
302
+
303
+ links_to_download = {}
304
+ for path, link in links.items():
305
+ if not os.path.isfile(path):
306
+ links_to_download[path] = link
307
+
308
+ if len(links_to_download) == 0:
309
+ print("사전 학습 모델이 모두 존재합니다.")
310
+ return
311
+
312
+ import gdown
313
+ import requests
314
+
315
+ def download(url, path):
316
+ r = requests.get(url, allow_redirects=True)
317
+ open(path, 'wb').write(r.content)
318
+
319
+ for path, url in links_to_download.items():
320
+ if not os.path.exists(os.path.dirname(path)):
321
+ os.makedirs(os.path.dirname(path))
322
+ print(f"사전 학습 모델 {path} 다운로드 중...")
323
+ if "drive.google.com" in url:
324
+ gdown.download(url, path, quiet=False)
325
+ else:
326
+ download(url, path)
327
+ print(f"사전 학습 모델 {path} 다운로드 완료")
328
+
329
+ print("모든 사전 학습 모델이 다운로드 되었습니다.")
330
+ return
331
+
332
+ def check_transformers():
333
+ try:
334
+ import transformers
335
+ del transformers
336
+ except:
337
+ print("transformers 라이브러리를 설치합니다.")
338
+ os.system(f"{sys.executable} -m pip install transformers")
339
+ print("transformers 라이브러리 설치 완료")
340
+ return
341
+
342
+ def check_tensorboard():
343
+ try:
344
+ import tensorboard
345
+ del tensorboard
346
+ except:
347
+ print("tensorboard 라이브러리를 설치합니다.")
348
+ os.system(f"{sys.executable} -m pip install tensorboard")
349
+ print("tensorboard 라이브러리 설치 완료")
350
+ return
351
+
352
+ if __name__ == "__main__":
353
+ check_pretrained()
354
+ check_transformers()
355
+ check_tensorboard()
356
+ webui = WebUI()
automated_pipeline.sh ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH --job-name=cfm_full_pipeline
3
+ #SBATCH --partition=a100
4
+ #SBATCH --gres=gpu:1
5
+ #SBATCH --cpus-per-task=8
6
+ #SBATCH --mem=64G
7
+ #SBATCH --time=120:00:00
8
+ #SBATCH --output=logs/pipeline_%j.out
9
+ #SBATCH --error=logs/pipeline_%j.err
10
+ #SBATCH --mail-type=ALL
11
+ #SBATCH --mail-user=hl3025@imperial.ac.uk
12
+
13
+ set -e # Exit on any error
14
+
15
+ # Navigate to project directory
16
+ cd /vol/bitbucket/hl3025/cfm_svc
17
+
18
+ # Activate environment
19
+ source .venv_linux/bin/activate
20
+
21
+ # Export environment variables
22
+ export PIP_CACHE_DIR=/vol/bitbucket/hl3025/pip_cache
23
+ export TMPDIR=/vol/bitbucket/hl3025/tmp
24
+
25
+ # Prevent BLAS/OpenMP from spawning too many threads
26
+ export OMP_NUM_THREADS=1
27
+ export OPENBLAS_NUM_THREADS=1
28
+ export MKL_NUM_THREADS=1
29
+ export VECLIB_MAXIMUM_THREADS=1
30
+ export NUMEXPR_NUM_THREADS=1
31
+
32
+ # Force Python output to be unbuffered so logs stream instantly
33
+ export PYTHONUNBUFFERED=1
34
+
35
+ # Create logs directory if it doesn't exist
36
+ mkdir -p logs
37
+
38
+ echo "======================================"
39
+ echo "Starting CFM SVC Automated Pipeline"
40
+ echo "======================================"
41
+ echo "Start time: $(date)"
42
+
43
+ # ============================================================================
44
+ # STAGE 1: Data Preprocessing
45
+ # ============================================================================
46
+ echo ""
47
+ echo "STAGE 1: Data Preprocessing with 8 threads..."
48
+ echo "Time: $(date)"
49
+ python svc_preprocessing.py -t 8
50
+
51
+ # ============================================================================
52
+ # STAGE 2: Codec Targets Generation
53
+ # ============================================================================
54
+ echo ""
55
+ echo "STAGE 2: Generating Codec Targets..."
56
+ echo "Time: $(date)"
57
+ python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets
58
+
59
+ # ============================================================================
60
+ # STAGE 3: Teacher Model Distillation (Offline)
61
+ # ============================================================================
62
+ echo ""
63
+ echo "STAGE 3: Offline Teacher Distillation..."
64
+ echo "Time: $(date)"
65
+ python preprocess_teacher.py \
66
+ --teacher_ckpt vits_pretrain/sovits5.0.pretrain.pth \
67
+ --teacher_config configs/base.yaml \
68
+ --codec_target_dir ./data_svc/codec_targets \
69
+ --data_root ./data_svc \
70
+ --out_dir ./data_svc/teacher_codec_targets \
71
+ --log_interval 200
72
+
73
+ # ============================================================================
74
+ # STAGE 4: CFM Training
75
+ # ============================================================================
76
+ echo ""
77
+ echo "STAGE 4: CFM Training with Teacher Distillation..."
78
+ echo "Time: $(date)"
79
+ python train_cfm.py \
80
+ --data_dir ./data_svc/codec_targets \
81
+ --teacher_target_dir ./data_svc/teacher_codec_targets \
82
+ --lambda_teacher 0 \
83
+ --batch_size 16 \
84
+ --lr 1e-4 \
85
+ --num_workers 4 \
86
+ --epochs 200 \
87
+ --log_interval 50 \
88
+ --save_interval 10
89
+
90
+ # ============================================================================
91
+ # Pipeline Complete
92
+ # ============================================================================
93
+ echo ""
94
+ echo "======================================"
95
+ echo "CFM SVC Automated Pipeline Complete!"
96
+ echo "======================================"
97
+ echo "End time: $(date)"
build_faiss_index.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import glob
3
+ import os
4
+ import faiss
5
+ import numpy as np
6
+ from tqdm import tqdm
7
+
8
+ def build_index(speaker_dir, output_path):
9
+ print(f"Finding HuBERT features in {speaker_dir}...")
10
+ vec_files = glob.glob(os.path.join(speaker_dir, "*.vec.npy"))
11
+
12
+ if not vec_files:
13
+ print(f"No .vec.npy files found in {speaker_dir}!")
14
+ return
15
+
16
+ print(f"Found {len(vec_files)} files. Loading vectors...")
17
+
18
+ all_vectors = []
19
+ for f in tqdm(vec_files):
20
+ vec = np.load(f) # (T, 256)
21
+ all_vectors.append(vec)
22
+
23
+ all_vectors = np.concatenate(all_vectors, axis=0).astype(np.float32)
24
+ print(f"Total frames: {all_vectors.shape[0]}, Feature dimension: {all_vectors.shape[1]}")
25
+
26
+ # Initialize FAISS index
27
+ # We use IndexFlatL2 for exact nearest neighbor search based on L2 distance.
28
+ index = faiss.IndexFlatL2(all_vectors.shape[1])
29
+
30
+ print("Adding vectors to FAISS index...")
31
+ index.add(all_vectors)
32
+
33
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
34
+
35
+ print(f"Saving index to {output_path}...")
36
+ faiss.write_index(index, output_path)
37
+
38
+ # Save the original vectors as well so we can retrieve them and average them
39
+ vectors_path = output_path.replace(".index", "_vectors.npy")
40
+ print(f"Saving source vectors to {vectors_path}...")
41
+ np.save(vectors_path, all_vectors)
42
+
43
+ print("Done!")
44
+
45
+ if __name__ == "__main__":
46
+ parser = argparse.ArgumentParser()
47
+ parser.add_argument("--speaker_dir", type=str, required=True, help="Path to speaker's HuBERT directory (e.g. data_svc/hubert/singer_0005)")
48
+ parser.add_argument("--output_path", type=str, required=True, help="Where to save the .index file (e.g. data_svc/hubert/singer_0005/feature.index)")
49
+ args = parser.parse_args()
50
+
51
+ build_index(args.speaker_dir, args.output_path)
build_mmap.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import glob
3
+ import numpy as np
4
+ from tqdm import tqdm
5
+ import argparse
6
+
7
+ def build_mmap(data_dir, feature_name, output_prefix):
8
+ """
9
+ Combines all .npy files for a feature (e.g. hubert, ppg) into a single large
10
+ memory-mapped array alongside an index file for fast O(1) lookups.
11
+ """
12
+ print(f"Finding {feature_name} features in {data_dir}...")
13
+ files = glob.glob(os.path.join(data_dir, "**", "*.npy"), recursive=True)
14
+
15
+ if not files:
16
+ print(f"No {feature_name} files found!")
17
+ return
18
+
19
+ # We don't want to load them all into RAM at once, so we do two passes.
20
+ # First pass: find total frames and index mapping.
21
+ print("Pass 1: calculating total frames and indexing...")
22
+ total_frames = 0
23
+ dim = None
24
+ dtype = None
25
+
26
+ index_map = {} # { filename: (start_idx, length) }
27
+
28
+ valid_files = []
29
+
30
+ for f in tqdm(files):
31
+ # We can memory map just to get shape/dtype quickly
32
+ try:
33
+ arr = np.load(f, mmap_mode='r')
34
+ if dim is None:
35
+ dim = arr.shape[1] if len(arr.shape) > 1 else 1
36
+ dtype = arr.dtype
37
+
38
+ length = arr.shape[0]
39
+
40
+ # Use relative path as key
41
+ rel_path = os.path.relpath(f, start=data_dir)
42
+
43
+ index_map[rel_path] = (total_frames, length)
44
+ total_frames += length
45
+ valid_files.append((f, rel_path, length))
46
+
47
+ except Exception as e:
48
+ pass
49
+
50
+ print(f"Total valid files: {len(valid_files)}")
51
+ print(f"Total frames: {total_frames}, Feature dim: {dim}")
52
+
53
+ # Second pass: allocate mmap and write
54
+ mmap_path = f"{output_prefix}.npy"
55
+ index_path = f"{output_prefix}_index.npy"
56
+
57
+ shape = (total_frames, dim) if dim > 1 else (total_frames,)
58
+ print(f"Allocating mmap at {mmap_path} with shape {shape}...")
59
+
60
+ mmap_arr = np.lib.format.open_memmap(mmap_path, mode='w+', dtype=dtype, shape=shape)
61
+
62
+ print("Pass 2: writing data to mmap...")
63
+ for f, rel_path, length in tqdm(valid_files):
64
+ start_idx, _ = index_map[rel_path]
65
+
66
+ arr = np.load(f)
67
+ if dim == 1:
68
+ mmap_arr[start_idx : start_idx + length] = arr
69
+ else:
70
+ mmap_arr[start_idx : start_idx + length, :] = arr
71
+
72
+ mmap_arr.flush()
73
+
74
+ print(f"Saving index map to {index_path}...")
75
+ np.save(index_path, index_map)
76
+ print("Done!")
77
+
78
+ if __name__ == "__main__":
79
+ parser = argparse.ArgumentParser()
80
+ parser.add_argument("--data_dir", type=str, required=True, help="Base dir (e.g. data_svc/hubert)")
81
+ parser.add_argument("--feature", type=str, required=True, help="Feature name for printing")
82
+ parser.add_argument("--out_prefix", type=str, required=True, help="Output path prefix (e.g. data_svc/hubert_mmap)")
83
+ args = parser.parse_args()
84
+
85
+ build_mmap(args.data_dir, args.feature, args.out_prefix)
compare_audio.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import soundfile as sf
2
+ import librosa
3
+ import numpy as np
4
+
5
+ wav_gt, sr = librosa.load('test_train_gt.wav', sr=44100)
6
+ wav_pred, _ = librosa.load('test_overfit_pe.wav', sr=44100)
7
+
8
+ min_len = min(len(wav_gt), len(wav_pred))
9
+
10
+ # Calculate spectral diff
11
+ S_gt = np.abs(librosa.stft(wav_gt[:min_len]))
12
+ S_pred = np.abs(librosa.stft(wav_pred[:min_len]))
13
+
14
+ diff = np.mean(np.abs(S_gt - S_pred))
15
+ print("Spectral Mean Absolute Error:", diff)
16
+
17
+ # Write a mix to see if they are identical but one is delayed
18
+ mix = np.zeros_like(wav_gt[:min_len])
19
+ mix = (wav_gt[:min_len] + wav_pred[:min_len]) / 2
20
+ sf.write('test_mix.wav', mix, 44100)
configs/base.yaml ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ train:
2
+ model: "sovits"
3
+ seed: 1234
4
+ epochs: 10
5
+ learning_rate: 5e-5
6
+ betas: [0.8, 0.99]
7
+ lr_decay: 0.999875
8
+ eps: 1e-9
9
+ batch_size: 2
10
+ c_stft: 9
11
+ c_mel: 1.
12
+ c_kl: 0.2
13
+ port: 8001
14
+ pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
15
+ #############################
16
+ data:
17
+ training_files: "files/train.txt"
18
+ validation_files: "files/valid.txt"
19
+ segment_size: 8000 # WARNING: base on hop_length
20
+ max_wav_value: 32768.0
21
+ sampling_rate: 32000
22
+ filter_length: 1024
23
+ hop_length: 320
24
+ win_length: 1024
25
+ mel_channels: 100
26
+ mel_fmin: 50.0
27
+ mel_fmax: 16000.0
28
+ #############################
29
+ vits:
30
+ ppg_dim: 1280
31
+ vec_dim: 256
32
+ spk_dim: 256
33
+ gin_channels: 256
34
+ inter_channels: 192
35
+ hidden_channels: 192
36
+ filter_channels: 640
37
+ #############################
38
+ gen:
39
+ upsample_input: 192
40
+ upsample_rates: [5,4,4,2,2]
41
+ upsample_kernel_sizes: [15,8,8,4,4]
42
+ upsample_initial_channel: 320
43
+ resblock_kernel_sizes: [3,7,11]
44
+ resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
45
+ #############################
46
+ mpd:
47
+ periods: [2,3,5,7,11]
48
+ kernel_size: 5
49
+ stride: 3
50
+ use_spectral_norm: False
51
+ lReLU_slope: 0.2
52
+ #############################
53
+ mrd:
54
+ resolutions: "[(1024, 120, 600), (2048, 240, 1200), (4096, 480, 2400), (512, 50, 240)]" # (filter_length, hop_length, win_length)
55
+ use_spectral_norm: False
56
+ lReLU_slope: 0.2
57
+ #############################
58
+ log:
59
+ info_interval: 100
60
+ eval_interval: 10
61
+ save_interval: 10
62
+ num_audio: 6
63
+ pth_dir: 'chkpt'
64
+ log_dir: 'logs'
65
+ keep_ckpts: 0
66
+ #############################
67
+ dist_config:
68
+ dist_backend: "nccl"
69
+ dist_url: "tcp://localhost:54321"
70
+ world_size: 1
71
+
configs/singers/singer0001.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0002.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0003.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0004.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0005.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0006.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0007.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0008.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0009.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0010.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0011.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0012.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0013.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0014.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0015.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0016.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0017.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0018.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0019.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0020.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0021.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0022.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0023.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0024.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0025.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0026.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0027.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0028.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0029.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0030.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0031.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0032.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0033.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0034.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0035.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0036.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0037.npy ADDED
Binary file (1.15 kB). View file
 
configs/singers/singer0038.npy ADDED
Binary file (1.15 kB). View file