Spaces:

mvp-lab
/

cfm_svc

Running on Zero

App Files Files Community

Hector Li commited on Mar 6

Commit

df93d13

0 Parent(s):

Initial commit for Hugging Face

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +9 -0
.gitignore +40 -0
LICENSE +21 -0
README.md +283 -0
README_OLD.md +382 -0
README_V1.md +102 -0
app.py +356 -0
automated_pipeline.sh +97 -0
build_faiss_index.py +51 -0
build_mmap.py +85 -0
compare_audio.py +20 -0
configs/base.yaml +71 -0
configs/singers/singer0001.npy +0 -0
configs/singers/singer0002.npy +0 -0
configs/singers/singer0003.npy +0 -0
configs/singers/singer0004.npy +0 -0
configs/singers/singer0005.npy +0 -0
configs/singers/singer0006.npy +0 -0
configs/singers/singer0007.npy +0 -0
configs/singers/singer0008.npy +0 -0
configs/singers/singer0009.npy +0 -0
configs/singers/singer0010.npy +0 -0
configs/singers/singer0011.npy +0 -0
configs/singers/singer0012.npy +0 -0
configs/singers/singer0013.npy +0 -0
configs/singers/singer0014.npy +0 -0
configs/singers/singer0015.npy +0 -0
configs/singers/singer0016.npy +0 -0
configs/singers/singer0017.npy +0 -0
configs/singers/singer0018.npy +0 -0
configs/singers/singer0019.npy +0 -0
configs/singers/singer0020.npy +0 -0
configs/singers/singer0021.npy +0 -0
configs/singers/singer0022.npy +0 -0
configs/singers/singer0023.npy +0 -0
configs/singers/singer0024.npy +0 -0
configs/singers/singer0025.npy +0 -0
configs/singers/singer0026.npy +0 -0
configs/singers/singer0027.npy +0 -0
configs/singers/singer0028.npy +0 -0
configs/singers/singer0029.npy +0 -0
configs/singers/singer0030.npy +0 -0
configs/singers/singer0031.npy +0 -0
configs/singers/singer0032.npy +0 -0
configs/singers/singer0033.npy +0 -0
configs/singers/singer0034.npy +0 -0
configs/singers/singer0035.npy +0 -0
configs/singers/singer0036.npy +0 -0
configs/singers/singer0037.npy +0 -0
configs/singers/singer0038.npy +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,9 @@

+*.hdf5 filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
+*.cWG5V7 filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text
+*.exe filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,40 @@

+__pycache__/
+runtime/
+.venv/
+venv/
+.venv_linux/
+.vscode/
+*_pretrain/
+crepe/assets/full.pth
+chkpt/
+data_svc/
+dataset_raw/
+files/
+logs/
+sovits5.0.pth
+svc_out_pit.wav
+svc_out.wav
+svc_tmp.pit.csv
+svc_tmp.ppg.npy
+svc_tmp.vec.npy
+test.wav
+so-vits-svc-5.0-*.zip
+# Ignore model checkpoints and large audio arrays
+*.pt
+*.pth
+model_1200000.safetensors
+*.wav
+chkpt/
+chkpt_cfm/
+logs/
+opensinger/
+dataset_raw_old/
+data_svc_infer/
+stable-audio-tools/

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 PlayVoice
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,283 @@

+# CFM-SVC / F5-SVC — Singing Voice Conversion
+Two implementations of a flow-matching-based Singing Voice Conversion (SVC) system.
+| | V1 (CFM-SVC) | V2 (F5-SVC) |
+|---|---|---|
+| Backbone | DiT trained from scratch | F5-TTS pretrained (LoRA) |
+| Output space | DAC codec latents (1024-dim) | Log-mel spectrogram (100-dim) |
+| Vocoder | DAC decoder (frozen) | Vocos (frozen) |
+| Params trained | ~82M | ~5M (adapter + LoRA) |
+| Training data | Multi-speaker singing | Multi-speaker singing |
+| Speaker adaptation | Speaker d-vector | Stage 2: spk_proj on speech clips |
+---
+## Project Structure
+```
+matcha_svc/
+├── models/
+│   ├── cfm.py                  V1: Diffusion Transformer (DiT)
+│   ├── cond_encoder.py         V1: PPG+HuBERT+F0+Speaker → conditioning
+│   ├── codec_wrapper.py        V1: DAC codec + projector head
+│   ├── svc_cond_adapter.py     V2: PPG+HuBERT+F0+Speaker → F5-TTS text_dim
+│   ├── lora_utils.py           V2: LoRALinear, inject_lora(), freeze_non_lora()
+│   └── f5_svc.py               V2: F5SVCModel wrapper + build_f5svc() factory
+│
+├── losses/
+│   └── cfm_loss.py             V1: flow matching + projector commitment loss
+│
+├── svc_data/
+│   └── mel_svc_dataset.py      V2: log-mel dataset (same directory layout as V1)
+│
+├── train_cfm.py                V1 training script
+├── train_f5_stage1.py          V2 Stage 1: SVCCondAdapter + LoRA on singing data
+├── train_f5_stage2.py          V2 Stage 2: spk_proj on target speaker speech
+├── infer_f5_svc.py             V2 inference: Euler sampling → Vocos → .wav
+├── submit_train.sh             SLURM job script for V1
+│
+├── data_svc/                   Preprocessed features (generated by svc_preprocessing.py)
+│   ├── audio/<spk>/<id>.wav
+│   ├── whisper/<spk>/<id>.ppg.npy
+│   ├── hubert/<spk>/<id>.vec.npy
+│   ├── pitch/<spk>/<id>.pit.npy
+│   ├── speaker/<spk>/<id>.spk.npy
+│   └── codec_targets/<spk>/<id>.pt   ← V1 only
+│
+├── chkpt_cfm/                  V1 checkpoints
+└── chkpt_f5svc/                V2 checkpoints
+```
+---
+## Prerequisites
+```bash
+python -m venv .venv
+source .venv/bin/activate        # or .venv\Scripts\activate on Windows
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+pip install -r requirements.txt
+pip install descript-audio-codec  # V1
+pip install f5-tts vocos safetensors huggingface_hub  # V2
+```
+Pretrained feature extractors (shared by V1 and V2):
+| File | Destination |
+|---|---|
+| `best_model.pth.tar` (Speaker encoder) | `speaker_pretrain/` |
+| `large-v2.pt` (Whisper) | `whisper_pretrain/` |
+| `hubert-soft-0d54a1f4.pt` | `hubert_pretrain/` |
+| `full.pth` (CREPE) | `crepe/assets/` |
+---
+## Data Preparation (shared by V1 and V2)
+### 1. Raw audio layout
+```
+dataset_raw/
+├── speaker0/
+│   ├── 000001.wav
+│   └── ...
+└── speaker1/
+    └── ...
+```
+Clips should be clean vocals, < 30 seconds, no accompaniment.
+Use UVR for source separation and audio-slicer for cutting.
+### 2. Extract features
+```bash
+python svc_preprocessing.py -t 2
+```
+Produces under `data_svc/`:
+- `whisper/<spk>/<id>.ppg.npy` — Whisper PPG (1280-dim, 50 Hz)
+- `hubert/<spk>/<id>.vec.npy` — HuBERT (256-dim, 50 Hz)
+- `pitch/<spk>/<id>.pit.npy` — F0 in Hz (50 Hz, 0 = unvoiced)
+- `speaker/<spk>/<id>.spk.npy` — Speaker d-vector (256-dim)
+### 3. V1 only: extract codec targets
+```bash
+python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets
+```
+V2 computes mel spectrograms on-the-fly from the raw audio — no offline codec step needed.
+---
+## V1: CFM-SVC (Training from Scratch)
+### Train
+```bash
+python train_cfm.py \
+    --data_dir    ./data_svc/codec_targets \
+    --batch_size  64 \
+    --lr          2e-5 \
+    --epochs      250 \
+    --save_interval 1
+# or via SLURM:
+sbatch submit_train.sh
+```
+Training automatically resumes from the latest checkpoint in `chkpt_cfm/`.
+Key arguments:
+| Argument | Default | Description |
+|---|---|---|
+| `--lr` | `1e-4` | Learning rate |
+| `--batch_size` | `2` | Batch size |
+| `--grad_accum` | `1` | Gradient accumulation steps |
+| `--grad_clip` | `1.0` | Gradient clip max norm |
+| `--save_interval` | `50` | Save every N epochs |
+| `--use_checkpointing` | off | Enable gradient checkpointing (saves VRAM) |
+| `--freeze_norm` | off | Freeze latent norm stats (for fine-tuning) |
+### Inference (V1)
+```bash
+python infer.py --wave /path/to/source_singing.wav
+```
+---
+## V2: F5-SVC (LoRA on F5-TTS)
+### Architecture
+- F5-TTS's DiT is loaded with pretrained weights and kept mostly frozen.
+- `SVCCondAdapter` replaces the text encoder: PPG + HuBERT + F0 + speaker → (B, T, 512).
+- LoRA (rank 16) is injected into every DiT attention projection (Q, K, V, Out).
+- Vocos decodes mel spectrograms to audio.
+- Two-stage training protocol:
+  - **Stage 1** (singing): SVCCondAdapter + LoRA trained on multi-speaker singing data.
+  - **Stage 2** (per-speaker): only `spk_proj` trained on the target speaker's speech clips.
+### Download F5-TTS checkpoint
+```python
+from huggingface_hub import hf_hub_download
+path = hf_hub_download("SWivid/F5-TTS", "F5TTS_Base/model_1200000.safetensors")
+print(path)
+```
+### Stage 1 — Singing Adaptation
+Trains: `SVCCondAdapter` (content projection + speaker projection) + LoRA adapters
+Freezes: All other DiT weights
+```bash
+python train_f5_stage1.py \
+    --f5tts_ckpt /path/to/model_1200000.safetensors \
+    --audio_dir  ./data_svc/audio \
+    --epochs     200 \
+    --batch_size 16 \
+    --lr         1e-4
+# Checkpoints saved to ./chkpt_f5svc/stage1_epoch_N.pt
+```
+All PPG/HuBERT/F0/speaker features from V1 preprocessing are reused directly.
+The only difference is the audio directory name: V1 produces `data_svc/waves-32k/`
+while V2 defaults to `data_svc/audio/`. Pass `--audio_dir ./data_svc/waves-32k` to
+reuse V1 audio (it is resampled to 24 kHz on-the-fly, no re-extraction needed).
+The codec targets directory (`data_svc/codec_targets/`) is V1-only and not needed here.
+### Stage 2 — Per-Speaker Fine-tuning
+Trains: `svc_adapter.spk_proj` only
+Freezes: DiT + LoRA (locked in from Stage 1)
+Data: Speech clips of the target speaker (no singing required)
+```bash
+python train_f5_stage2.py \
+    --stage1_ckpt ./chkpt_f5svc/stage1_epoch_200.pt \
+    --audio_dir   ./data_svc/audio/my_speaker \
+    --speaker_id  my_speaker \
+    --epochs      50
+# Saved to ./chkpt_f5svc/stage2_my_speaker.pt
+```
+The target speaker's speech clips need the same feature extraction as Stage 1:
+run `svc_preprocessing.py` pointing at the speech audio directory.
+### Inference (V2)
+```bash
+python infer_f5_svc.py \
+    --ckpt       ./chkpt_f5svc/stage1_epoch_200.pt \
+    --source     ./source_singing.wav \
+    --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
+    --ref_audio  ./data_svc/audio/my_speaker/ref.wav \
+    --output     ./converted.wav \
+    --steps      32
+```
+For a Stage 2 speaker-adapted checkpoint:
+```bash
+python infer_f5_svc.py \
+    --ckpt       ./chkpt_f5svc/stage2_my_speaker.pt \
+    --source     ./source_singing.wav \
+    --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
+    --ref_audio  ./data_svc/audio/my_speaker/ref.wav \
+    --output     ./converted.wav
+```
+Inference arguments:
+| Argument | Default | Description |
+|---|---|---|
+| `--ckpt` | required | Stage 1 or Stage 2 checkpoint |
+| `--source` | required | Source singing .wav |
+| `--target_spk` | required | Target speaker .spk.npy |
+| `--ref_audio` | `None` | Short .wav of target speaker for timbre reference |
+| `--ref_sec` | `3.0` | Seconds of ref_audio to use |
+| `--steps` | `32` | Euler ODE steps (more = higher quality, slower) |
+| `--output` | `./converted.wav` | Output path |
+The source audio must have pre-extracted features (PPG, HuBERT, F0) in the standard
+`data_svc/` directory structure. Run `svc_preprocessing.py` on the source if needed.
+---
+## Checkpoints
+V1 saves full model state per epoch to `chkpt_cfm/`:
+```
+chkpt_cfm/
+├── dit_epoch_N.pt
+├── cond_encoder_epoch_N.pt
+├── projector_epoch_N.pt
+├── ema_dit_epoch_N.pt
+├── optimizer_epoch_N.pt
+├── scheduler_epoch_N.pt
+└── latent_norm.pt          ← cached normalization stats
+```
+V2 saves adapter + LoRA state per epoch to `chkpt_f5svc/`:
+```
+chkpt_f5svc/
+├── stage1_epoch_N.pt       ← full model state (adapter + LoRA + frozen DiT)
+│                              also contains lora_only key for lightweight sharing
+└── stage2_<speaker_id>.pt  ← speaker-adapted state
+```
+---
+## References
+- Rectified Flow / Flow Matching
+- F5-TTS: [SWivid/F5-TTS](https://github.com/SWivid/F5-TTS)
+- Vocos vocoder: [hubert-whisper/vocos](https://github.com/hubert-whisper/vocos)
+- DAC: [descriptinc/descript-audio-codec](https://github.com/descriptinc/descript-audio-codec)
+- so-vits-svc-5.0: preprocessing pipeline

README_OLD.md ADDED Viewed

	@@ -0,0 +1,382 @@

+<div align="center">
+<h1> Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS </h1>
+[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/sovits5.0)
+<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/PlayVoice/so-vits-svc-5.0">
+<img alt="GitHub forks" src="https://img.shields.io/github/forks/PlayVoice/so-vits-svc-5.0">
+<img alt="GitHub issues" src="https://img.shields.io/github/issues/PlayVoice/so-vits-svc-5.0">
+<img alt="GitHub" src="https://img.shields.io/github/license/PlayVoice/so-vits-svc-5.0">
+</div>
+- This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
+- This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
+- This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
+- This project will not develop one-click packages for other purposes;
+![vits-5.0-frame](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/3854b281-8f97-4016-875b-6eb663c92466)
+- 6GB low minimum VRAM requirement for training
+- support for multiple speakers
+- create unique speakers through speaker mixing
+- even voices with light accompaniment can also be converted
+- F0 can be edited using Excel
+https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/6a09805e-ab93-47fe-9a14-9cbc1e0e7c3a
+Powered by [@ShadowVap](https://space.bilibili.com/491283091)
+## Model properties
+| Feature | From | Status | Function |
+| :--- | :--- | :--- | :--- |
+| whisper | OpenAI | ✅ | strong noise immunity |
+| bigvgan  | NVIDA | ✅ | alias and snake | The formant is clearer and the sound quality is obviously improved |
+| natural speech | Microsoft | ✅ | reduce mispronunciation |
+| neural source-filter | NII | ✅ | solve the problem of audio F0 discontinuity |
+| speaker encoder | Google | ✅ | Timbre Encoding and Clustering |
+| GRL for speaker | Ubisoft |✅ | Preventing Encoder Leakage Timbre |
+| SNAC |  Samsung | ✅ | One Shot Clone of VITS |
+| SCLN |  Microsoft | ✅ | Improve Clone |
+| PPG perturbation | this project | ✅ | Improved noise immunity and de-timbre |
+| HuBERT perturbation | this project | ✅ | Improved noise immunity and de-timbre |
+| VAE perturbation | this project | ✅ | Improve sound quality |
+| MIX encoder | this project | ✅ | Improve conversion stability |
+| USP infer | this project | ✅ | Improve conversion stability |
+due to the use of data perturbation, it takes longer to train than other projects.
+**USP : Unvoice and Silence with Pitch when infer**
+![vits_svc_usp](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/ba733b48-8a89-4612-83e0-a0745587d150)
+## Quick Installation
+```PowerShell
+# clone project
+git clone https://github.com/ouor/so-vits-svc-5.0
+# create virtual environment
+python -m venv .venv
+# activate virtual environment
+.venv\Scripts\activate
+# install pytorch
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
+# install dependencies
+pip install -r requirements.txt
+# run app.py
+python app.py
+```
+## Setup Environment
+1. Install [PyTorch](https://pytorch.org/get-started/locally/).
+2. Install project dependencies
+    ```shell
+    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
+    ```
+    **Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error**
+3. Download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar`  into `speaker_pretrain/`.
+4. Download whisper model [whisper-large-v2](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt). Make sure to download `large-v2.pt`，put it into `whisper_pretrain/`.
+5. Download [hubert_soft model](https://github.com/bshall/hubert/releases/tag/v0.1)，put `hubert-soft-0d54a1f4.pt` into `hubert_pretrain/`.
+6. Download pitch extractor [crepe full](https://github.com/maxrmorrison/torchcrepe/tree/master/torchcrepe/assets)，put `full.pth` into `crepe/assets`.
+7. Download pretrain model [sovits5.0.pretrain.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/5.0/), and put it into `vits_pretrain/`.
+    ```shell
+    python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
+    ```
+## Dataset preparation
+Necessary pre-processing:
+1. Separate vocie and accompaniment with [UVR](https://github.com/Anjok07/ultimatevocalremovergui) (skip if no accompaniment)
+2. Cut audio input to shorter length with [slicer](https://github.com/flutydeer/audio-slicer), whisper takes input less than 30 seconds.
+3. Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
+4. Adjust loudness if necessary, recommand Adobe Audiiton.
+5. Put the dataset into the `dataset_raw` directory following the structure below.
+```
+dataset_raw
+├───speaker0
+│   ├───000001.wav
+│   ├───...
+│   └───000xxx.wav
+└───speaker1
+    ├───000001.wav
+    ├───...
+    └───000xxx.wav
+```
+## Data preprocessing
+```shell
+python sve_preprocessing.py -t 2
+```
+`-t`: threading, max number should not exceed CPU core count, usually 2 is enough.
+After preprocessing you will get an output with following structure.
+```
+data_svc/
+└── waves-16k
+│    └── speaker0
+│    │      ├── 000001.wav
+│    │      └── 000xxx.wav
+│    └── speaker1
+│           ├── 000001.wav
+│           └── 000xxx.wav
+└── waves-32k
+│    └── speaker0
+│    │      ├── 000001.wav
+│    │      └── 000xxx.wav
+│    └── speaker1
+│           ├── 000001.wav
+│           └── 000xxx.wav
+└── pitch
+│    └── speaker0
+│    │      ├── 000001.pit.npy
+│    │      └── 000xxx.pit.npy
+│    └── speaker1
+│           ├── 000001.pit.npy
+│           └── 000xxx.pit.npy
+└── hubert
+│    └── speaker0
+│    │      ├── 000001.vec.npy
+│    │      └── 000xxx.vec.npy
+│    └── speaker1
+│           ├── 000001.vec.npy
+│           └── 000xxx.vec.npy
+└── whisper
+│    └── speaker0
+│    │      ├── 000001.ppg.npy
+│    │      └── 000xxx.ppg.npy
+│    └── speaker1
+│           ├── 000001.ppg.npy
+│           └── 000xxx.ppg.npy
+└── speaker
+│    └── speaker0
+│    │      ├── 000001.spk.npy
+│    │      └── 000xxx.spk.npy
+│    └── speaker1
+│           ├── 000001.spk.npy
+│           └── 000xxx.spk.npy
+└── singer
+    ├── speaker0.spk.npy
+    └── speaker1.spk.npy
+```
+1.  Re-sampling
+    - Generate audio with a sampling rate of 16000Hz in `./data_svc/waves-16k`
+    ```
+    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
+    ```
+    - Generate audio with a sampling rate of 32000Hz in `./data_svc/waves-32k`
+    ```
+    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
+    ```
+2. Use 16K audio to extract pitch
+    ```
+    python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
+    ```
+3. Use 16K audio to extract ppg
+    ```
+    python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
+    ```
+4. Use 16K audio to extract hubert
+    ```
+    python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
+    ```
+5. Use 16k audio to extract timbre code
+    ```
+    python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
+    ```
+6. Extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
+    ```
+    python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
+    ```
+7. use 32k audio to extract the linear spectrum
+    ```
+    python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
+    ```
+8. Use 32k audio to generate training index
+    ```
+    python prepare/preprocess_train.py
+    ```
+11. Training file debugging
+    ```
+    python prepare/preprocess_zzz.py
+    ```
+## Train
+1. If fine-tuning based on the pre-trained model, you need to download the pre-trained model: [sovits5.0.pretrain.pth](https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/5.0). Put pretrained model under project root, change this line
+    ```
+    pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
+    ```
+    in `configs/base.yaml`，and adjust the learning rate appropriately, eg 5e-5.
+   `batch_szie`: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower.
+2. Start training
+   ```
+   python svc_trainer.py -c configs/base.yaml -n sovits5.0
+   ```
+3. Resume training
+   ```
+   python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth
+   ```
+4. Log visualization
+   ```
+   tensorboard --logdir logs/
+   ```
+![sovits5 0_base](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/1628e775-5888-4eac-b173-a28dca978faa)
+![sovits_spec](https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/c4223cf3-b4a0-4325-bec0-6d46d195a1fc)
+## Inference
+1. Export inference model: text encoder, Flow network, Decoder network
+   ```
+   python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
+   ```
+2. Inference
+   - if there is no need to adjust `f0`, just run the following command.
+   ```
+   python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
+   ```
+   - if `f0` will be adjusted manually, follow the steps:
+     1. use whisper to extract content encoding, generate `test.vec.npy`.
+       ```
+       python whisper/inference.py -w test.wav -p test.ppg.npy
+       ```
+     2. use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage
+       ```
+       python hubert/inference.py -w test.wav -v test.vec.npy
+       ```
+     3. extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
+       ```
+       python pitch/inference.py -w test.wav -p test.csv
+       ```
+     4. final inference
+       ```
+       python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0
+       ```
+3. Notes
+    - when `--ppg` is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
+    - when `--vec` is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
+    - when `--pit` is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
+    - generate files in the current directory:svc_out.wav
+4. Arguments ref
+    | args |--config | --model | --spk | --wave | --ppg | --vec | --pit | --shift |
+    | :---:  | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+    | name | config path | model path | speaker | wave input | wave ppg | wave hubert | wave pitch | pitch shift |
+## Creat singer
+named by pure coincidence：average -> ave -> eva，eve(eva) represents conception and reproduction
+```
+python svc_eva.py
+```
+```python
+eva_conf = {
+    './configs/singers/singer0022.npy': 0,
+    './configs/singers/singer0030.npy': 0,
+    './configs/singers/singer0047.npy': 0.5,
+    './configs/singers/singer0051.npy': 0.5,
+}
+```
+the generated singer file will be `eva.spk.npy`.
+## Data set
+| Name | URL |
+| :--- | :--- |
+|KiSing         |http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/|
+|PopCS          |https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md|
+|opencpop       |https://wenet.org.cn/opencpop/download/|
+|Multi-Singer   |https://github.com/Multi-Singer/Multi-Singer.github.io|
+|M4Singer       |https://github.com/M4Singer/M4Singer/blob/master/apply_form.md|
+|CSD            |https://zenodo.org/record/4785016#.YxqrTbaOMU4|
+|KSS            |https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset|
+|JVS MuSic      |https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music|
+|PJS            |https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus|
+|JUST Song      |https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song|
+|MUSDB18        |https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems|
+|DSD100         |https://sigsep.github.io/datasets/dsd100.html|
+|Aishell-3      |http://www.aishelltech.com/aishell_3|
+|VCTK           |https://datashare.ed.ac.uk/handle/10283/2651|
+## Code sources and references
+https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)
+https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103)
+https://github.com/openai/whisper/ [paper](https://arxiv.org/abs/2212.04356)
+https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658)
+https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889)
+https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
+https://github.com/brentspell/hifi-gan-bwe
+https://github.com/mozilla/TTS
+https://github.com/bshall/soft-vc
+https://github.com/maxrmorrison/torchcrepe
+https://github.com/OlaWod/FreeVC [paper](https://arxiv.org/abs/2210.15418)
+[SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech](https://github.com/hcy71o/SNAC)
+[Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585)
+[AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf)
+[Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis](https://github.com/ubisoft/ubisoft-laforge-daft-exprt)
+[Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings](https://arxiv.org/abs/2305.05401)
+[Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion](https://arxiv.org/pdf/2305.09167.pdf)
+[Speaker normalization (GRL) for self-supervised speech emotion recognition](https://arxiv.org/abs/2202.01252)
+## Method of Preventing Timbre Leakage Based on Data Perturbation
+https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
+https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
+https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
+https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
+https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
+## Contributors
+<a href="https://github.com/PlayVoice/so-vits-svc/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=PlayVoice/so-vits-svc" />
+</a>
+## Relevant Projects
+- [LoRA-SVC](https://github.com/PlayVoice/lora-svc): decoder only svc
+- [NSF-BigVGAN](https://github.com/PlayVoice/NSF-BigVGAN): vocoder for more work

README_V1.md ADDED Viewed

	@@ -0,0 +1,102 @@

+Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on CFM
+This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project. This project implements a highly modular, mathematically rigorous Continuous Normalizing Flow (CFM) based Singing Voice Conversion (SVC) system using a pretrained codec and learned projection (Option C*).
+By replacing the VITS/VAE monoliths with a Diffusion Transformer (DiT) and an explicit codebook projector, we achieve stronger temporal dependency modeling and faster, more stable training without the overhead of learning an autoencoder from scratch.
+## Architecture Highlights
+- **Frozen Pretrained Codec**: Uses a pretrained neural codec (e.g., DAC 44KHz) purely for encoding and decoding, freezing its weights to save VRAM.
+- **Offline Data Processing**: `z_target` latents are extracted once before training, preventing massive CPU/GPU bottlenecks in dataloaders.
+- **Diffusion Transformer (DiT)**: Velocity field prediction $v_\theta$ uses a DiT instead of 1D U-Nets for state-of-the-art long-sequence audio modeling.
+- **Dual-Loss Formulation with Implied Targets**: Solves the mathematical trap of backpropagating through an ODE solver during training. Calculates projection commitments instantly via the target velocity.
+## Quick Installation
+```bash
+# clone project
+git clone https://github.com/ouor/so-vits-svc-5.0
+# create virtual environment
+python -m venv .venv
+# activate virtual environment
+.venv\Scripts\activate
+# install pytorch
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
+# install dependencies
+pip install -r requirements.txt
+pip install descript-audio-codec
+# run app.py (Gradio UI)
+python ui_cfm.py
+```
+## Setup Environment
+- Download the Timbre Encoder: Speaker-Encoder by @mueller91, put `best_model.pth.tar` into `speaker_pretrain/`.
+- Download whisper model whisper-large-v2. Make sure to download `large-v2.pt`, put it into `whisper_pretrain/`.
+- Download hubert_soft model, put `hubert-soft-0d54a1f4.pt` into `hubert_pretrain/`.
+- Download pitch extractor crepe full, put `full.pth` into `crepe/assets`.
+## Dataset preparation
+Necessary pre-processing:
+1. Separate voice and accompaniment with UVR (skip if no accompaniment).
+2. Cut audio input to shorter length with slicer (< 30s).
+3. Put the dataset into the `dataset_raw` directory following the structure below.
+```
+dataset_raw
+├───speaker0
+│   ├───000001.wav
+│   └───000xxx.wav
+└───speaker1
+    ├───000001.wav
+    └───000xxx.wav
+```
+## Data preprocessing (Offline Shift)
+Unlike traditional VAE-based SVC which handles encoding in the dataloader, this pipeline pre-extracts both conditioning and quantized continuous vectors to save GPU resources.
+1. **Standard Extractors**: Extract PPG (Whisper), F0 (Crepe), and Speaker embeddings into their respective `data_svc/` folders:
+   ```bash
+   python svc_preprocessing.py -t 2
+   ```
+2. **Codec Targets Extraction**: Run the new offline generation script to pass all waveforms through the frozen codec and cache `z_target` tensors.
+   ```bash
+   python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets
+   ```
+## Train
+You will jointly train the DiT velocity network $v_\theta$ and the lightweight projection network $P(u)$. The heavy codec encoder/decoder remains entirely offline.
+```bash
+# Start Training
+python train_cfm.py
+```
+*The training script utilizes the dual-loss schema (Flow matching MSE + Projector Commitment MSE) utilizing the implicit velocity targets rather than integrating an ODE. Models will automatically save to the `chkpt/` folder.*
+## Inference
+The inference pipeline extracts conditioning, samples the continuous spatial latent using your preferred ODE solver (Euler, Heun, RK4), snaps the sample back to codebook space using the projector, and finally decodes the waveform via the DAC codec. **Long audio inputs will automatically chunk into 30s segments to avoid VRAM overflow.**
+```bash
+# Run Inference
+python infer.py --wave /path/to/your/input.wav
+```
+### Notes on Inference Pipeline Components:
+- The **ODE Solver** (`samplers/ode.py`) is modular. You can configure solver steps and methods (`solver='rk4'`) based on your quality-vs-speed needs.
+- **Temporal Resampling** is handled automatically in `models/cond_encoder.py`, perfectly matching Whisper and Crepe conditionings to the target codec's continuous latent frame sequence length.
+## Code sources and references
+- Rectified Flow / Flow Matching literature
+- Diffusion Transformers (DiT) based on [Peebles & Xie, 2022]
+- Neural Audio Codecs (DAC / EnCodec)
+- so-vits-svc-5.0 original repository components extracted for preprocessing

app.py ADDED Viewed

	@@ -0,0 +1,356 @@

+import os
+import subprocess
+import yaml
+import sys
+import webbrowser
+import gradio as gr
+import shutil
+import soundfile
+import shlex
+class WebUI:
+    def __init__(self):
+        self.train_config_path = 'configs/train.yaml'
+        self.info = Info()
+        self.names = []
+        self.names2 = []
+        self.voice_names = []
+        base_config_path = 'configs/base.yaml'
+        if not os.path.exists(self.train_config_path):
+            shutil.copyfile(base_config_path, self.train_config_path)
+            print("초기화 성공")
+        else:
+            print("준비됨")
+        self.main_ui()
+    def main_ui(self):
+        with gr.Blocks(theme=gr.themes.Base(primary_hue=gr.themes.colors.green)) as ui:
+            gr.Markdown('# so-vits-svc5.0 WebUI')
+            with gr.Tab("학습"):
+                with gr.Accordion('학습 안내', open=False):
+                    gr.Markdown(self.info.train)
+                gr.Markdown('### 데이터셋 파일 복사')
+                with gr.Row():
+                    self.dataset_name = gr.Textbox(value='', placeholder='chopin', label='데이터셋 이름', info='데이터셋 화자의 이름을 입력하세요.', interactive=True)
+                    self.dataset_src = gr.Textbox(value='', placeholder='C:/Users/Tacotron2/Downloads/chopin_dataset/', label='데이터셋 폴더', info='데이터셋 wav 파일이 있는 폴더를 지정하세요.', interactive=True)
+                    self.bt_dataset_copy = gr.Button(value='복사', variant="primary")
+                gr.Markdown('### 전처리 파라미터 설정')
+                with gr.Row():
+                    self.model_name = gr.Textbox(value='sovits5.0', label='model', info='모델명', interactive=True)
+                    self.f0_extractor = gr.Dropdown(choices=['crepe'], value='crepe', label='f0_extractor', info='F0 추출기', interactive=True)
+                    self.thread_count = gr.Slider(minimum=1, maximum=os.cpu_count(), step=1, value=2, label='thread_count', info='전처리 스레드 수', interactive=True)
+                gr.Markdown('### 학습 파라미터 설정')
+                with gr.Row():
+                    self.learning_rate = gr.Number(value=5e-5, label='learning_rate', info='학습률', interactive=True)
+                    self.batch_size = gr.Slider(minimum=1, maximum=50, step=1, value=6, label='batch_size', info='배치 크기', interactive=True)
+                    self.epochs = gr.Textbox(value='100', label='epoch', info='학습 에포크 수', interactive=True)
+                with gr.Row():
+                    self.info_interval = gr.Number(value=50, label='info_interval', info='학습 로깅 간격(step}', interactive=True)
+                    self.eval_interval = gr.Number(value=1, label='eval_interval', info='검증 세트 간격(epoch}', interactive=True)
+                    self.save_interval = gr.Number(value=5, label='save_interval', info='체크포인트 저장 간격(epoch}', interactive=True)
+                    self.keep_ckpts = gr.Number(value=5, label='keep_ckpts', info='최신 체크포인트 파일 유지 갯수(0은 모두 저장)',interactive=True)
+                with gr.Row():
+                    self.use_pretrained = gr.Checkbox(label="use_pretrained", info='사전학습모델 사용 여부', value=True, interactive=True, visible=False)
+                gr.Markdown('### 학습 시작')
+                with gr.Row():
+                    self.bt_open_dataset_folder = gr.Button(value='데이터 세트 폴더 열기')
+                    self.bt_onekey_train = gr.Button('원클릭 학습 시작', variant="primary")
+                    self.bt_tb = gr.Button('Tensorboard 열기', variant="primary")
+                gr.Markdown('### 학습 재개')
+                with gr.Row():
+                    self.resume_model = gr.Dropdown(choices=sorted(self.names), label='Resume training progress from checkpoints', info='체크포인트에서 학습 진행 재개', interactive=True)
+                    with gr.Column():
+                        self.bt_refersh = gr.Button('새로 고침')
+                        self.bt_resume_train = gr.Button('학습 재개', variant="primary")
+            with gr.Tab("추론"):
+                with gr.Accordion('추론 안내', open=False):
+                    gr.Markdown(self.info.inference)
+                gr.Markdown('### 추론 파라미터 설정')
+                with gr.Row():
+                    with gr.Column():
+                        self.keychange = gr.Slider(-12, 12, value=0, step=1, label='음높이 조절')
+                        self.file_list = gr.Markdown(value="", label="파일 목록")
+                        with gr.Row():
+                            self.resume_model2 = gr.Dropdown(choices=sorted(self.names2), label='Select the model you want to export',
+                                                             info='내보낼 모델 선택', interactive=True)
+                            with gr.Column():
+                                self.bt_refersh2 = gr.Button(value='모델 및 사운드 새로 고침')
+                                self.bt_out_model = gr.Button(value='모델 내보내기', variant="primary")
+                        with gr.Row():
+                            self.resume_voice = gr.Dropdown(choices=sorted(self.voice_names), label='Select the sound file',
+                                                            info='*.spk.npy 파일 선택', interactive=True)
+                        with gr.Row():
+                            self.input_wav = gr.Audio(type='filepath', label='변환할 오디오 선택', source='upload')
+                        with gr.Row():
+                            self.bt_infer = gr.Button(value='변환 시작', variant="primary")
+                        with gr.Row():
+                            self.output_wav = gr.Audio(label='출력 오디오', interactive=False)
+            self.bt_dataset_copy.click(fn=self.copydataset, inputs=[self.dataset_name, self.dataset_src])
+            self.bt_open_dataset_folder.click(fn=self.openfolder)
+            self.bt_onekey_train.click(fn=self.onekey_training,inputs=[self.model_name, self.thread_count,self.learning_rate,self.batch_size, self.epochs, self.info_interval, self.eval_interval,self.save_interval, self.keep_ckpts, self.use_pretrained])
+            self.bt_out_model.click(fn=self.out_model, inputs=[self.model_name, self.resume_model2])
+            self.bt_tb.click(fn=self.tensorboard)
+            self.bt_refersh.click(fn=self.refresh_model, inputs=[self.model_name], outputs=[self.resume_model])
+            self.bt_resume_train.click(fn=self.resume_train, inputs=[self.model_name, self.resume_model, self.epochs])
+            self.bt_infer.click(fn=self.inference, inputs=[self.input_wav, self.resume_voice, self.keychange], outputs=[self.output_wav])
+            self.bt_refersh2.click(fn=self.refresh_model_and_voice, inputs=[self.model_name],outputs=[self.resume_model2, self.resume_voice])
+        ui.launch(inbrowser=True)
+    def copydataset(self, dataset_name, dataset_src):
+        assert dataset_name != '', '데이터셋 이름을 입력하세요'
+        assert dataset_src != '', '데이터셋 경로를 입력하세요'
+        assert os.path.isdir(dataset_src), '데이터셋 경로가 잘못되었습니다'
+        from glob import glob
+        wav_files = glob(os.path.join(dataset_src, '*.wav'))
+        assert len(wav_files) > 0, '데이터셋 경로에 wav 파일이 없습니다'
+        import shutil
+        dst_dir = os.path.join('dataset_raw', dataset_name)
+        if not os.path.exists(dst_dir): os.makedirs(dst_dir, exist_ok=True)
+        for wav_file in wav_files:
+            shutil.copy(wav_file, dst_dir)
+        print('데이터셋 복사 완료')
+    def openfolder(self):
+        if not os.path.exists('dataset_raw'): os.makedirs('dataset_raw', exist_ok=True)
+        try:
+            if sys.platform.startswith('win'):
+                os.startfile('dataset_raw')
+            elif sys.platform.startswith('linux'):
+                subprocess.call(['xdg-open', 'dataset_raw'])
+            elif sys.platform.startswith('darwin'):
+                subprocess.call(['open', 'dataset_raw'])
+            else:
+                print('폴더를 열지 못했습니다!')
+        except BaseException:
+            print('폴더를 열지 못했습니다!')
+    def preprocessing(self, thread_count):
+        print('전처리 시작')
+        train_process = subprocess.Popen(f'{sys.executable} -u svc_preprocessing.py -t {str(thread_count)}', stdout=subprocess.PIPE)
+        while train_process.poll() is None:
+            output = train_process.stdout.readline().decode('utf-8')
+            print(output, end='')
+    def create_config(self, model_name, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval,
+                      keep_ckpts, use_pretrained):
+        with open("configs/train.yaml", "r") as f:
+            config = yaml.load(f, Loader=yaml.FullLoader)
+        config['train']['model'] = model_name
+        config['train']['learning_rate'] = learning_rate
+        config['train']['batch_size'] = batch_size
+        config['train']['epochs'] = int(epochs)
+        config["log"]["info_interval"] = int(info_interval)
+        config["log"]["eval_interval"] = int(eval_interval)
+        config["log"]["save_interval"] = int(save_interval)
+        config["log"]["keep_ckpts"] = int(keep_ckpts)
+        if use_pretrained:
+            config["train"]["pretrain"] = "vits_pretrain/sovits5.0.pretrain.pth"
+        else:
+            config["train"]["pretrain"] = ""
+        with open("configs/train.yaml", "w") as f:
+            yaml.dump(config, f)
+        return f"로그 파라미터를 다음으로 업데이트했습니다.{config['log']}"
+    def training(self, model_name):
+        print('학습 시작')
+        print('학습을 수행하는 새로운 콘솔 창이 열립니다.')
+        print('학습 도중 학습을 중지하려���, 콘솔 창을 닫으세요.')
+        train_process = subprocess.Popen(f'{sys.executable} -u svc_trainer.py -c {self.train_config_path} -n {str(model_name)}', stdout=subprocess.PIPE, creationflags=subprocess.CREATE_NEW_CONSOLE)
+        while train_process.poll() is None:
+            output = train_process.stdout.readline().decode('utf-8')
+            print(output, end='')
+    def onekey_training(self, model_name, thread_count, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval, keep_ckpts, use_pretrained):
+        print(model_name, thread_count, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval, keep_ckpts)
+        self.create_config(model_name, learning_rate, batch_size, epochs, info_interval, eval_interval, save_interval, keep_ckpts, use_pretrained)
+        self.preprocessing(thread_count)
+        self.training(model_name)
+    def out_model(self, model_name, resume_model2):
+        print('모델 내보내기 시작')
+        try:
+            subprocess.Popen(f'{sys.executable} -u svc_export.py -c {self.train_config_path} -p "chkpt/{model_name}/{resume_model2}"',stdout=subprocess.PIPE)
+            print('모델 내보내기 성공')
+        except Exception as e:
+            print("에러 발생함：", e)
+    def tensorboard(self):
+        tensorboard_path = os.path.join(os.path.dirname(sys.executable), 'Scripts', 'tensorboard.exe')
+        print(tensorboard_path)
+        tb_process = subprocess.Popen(f'{tensorboard_path} --logdir=logs --port=6006', stdout=subprocess.PIPE, creationflags=subprocess.CREATE_NEW_CONSOLE)
+        webbrowser.open("http://localhost:6006")
+        while tb_process.poll() is None:
+            output = tb_process.stdout.readline().decode('utf-8')
+            print(output)
+    def refresh_model(self, model_name):
+        self.script_dir = os.path.dirname(os.path.abspath(__file__))
+        self.model_root = os.path.join(self.script_dir, f"chkpt/{model_name}")
+        self.names = []
+        try:
+            for self.name in os.listdir(self.model_root):
+                if self.name.endswith(".pt"):
+                    self.names.append(self.name)
+            return {"choices": sorted(self.names), "__type__": "update"}
+        except FileNotFoundError:
+            return {"label": "모델 파일 누락", "__type__": "update"}
+    def refresh_model2(self, model_name):
+        self.script_dir = os.path.dirname(os.path.abspath(__file__))
+        self.model_root = os.path.join(self.script_dir, f"chkpt/{model_name}")
+        self.names2 = []
+        try:
+            for self.name in os.listdir(self.model_root):
+                if self.name.endswith(".pt"):
+                    self.names2.append(self.name)
+            return {"choices": sorted(self.names2), "__type__": "update"}
+        except FileNotFoundError as e:
+            return {"label": "모델 파일 누락", "__type__": "update"}
+    def refresh_voice(self):
+        self.script_dir = os.path.dirname(os.path.abspath(__file__))
+        self.model_root = os.path.join(self.script_dir, "data_svc/singer")
+        self.voice_names = []
+        for self.name in os.listdir(self.model_root):
+            if self.name.endswith(".npy"):
+                self.voice_names.append(self.name)
+        return {"choices": sorted(self.voice_names), "__type__": "update"}
+    def refresh_model_and_voice(self, model_name):
+        model_update = self.refresh_model2(model_name)
+        voice_update = self.refresh_voice()
+        return model_update, voice_update
+    def resume_train(self, model_name, resume_model, epochs):
+        print('학습 재개')
+        with open("configs/train.yaml", "r") as f:
+            config = yaml.load(f, Loader=yaml.FullLoader)
+            config['epochs'] = epochs
+        with open("configs/train.yaml", "w") as f:
+            yaml.dump(config, f)
+        train_process = subprocess.Popen(f'{sys.executable} -u svc_trainer.py -c {self.train_config_path} -n {model_name} -p "chkpt/{model_name}/{resume_model}"', stdout=subprocess.PIPE, creationflags=subprocess.CREATE_NEW_CONSOLE)
+        while train_process.poll() is None:
+            output = train_process.stdout.readline().decode('utf-8')
+            print(output, end='')
+    def inference(self, input, resume_voice, keychange):
+        if os.path.isfile('test.wav'): os.remove('test.wav')
+        self.train_config_path = 'configs/train.yaml'
+        print('추론 시작')
+        shutil.copy(input, ".")
+        input_name = os.path.basename(input)
+        os.rename(input_name, "test.wav")
+        input_name = "test.wav"
+        if not input_name.endswith(".wav"):
+            data, samplerate = soundfile.read(input_name)
+            input_name = input_name.rsplit(".", 1)[0] + ".wav"
+            soundfile.write(input_name, data, samplerate)
+        train_config_path = shlex.quote(self.train_config_path)
+        keychange = shlex.quote(str(keychange))
+        cmd = [f'{sys.executable}', "-u", "svc_inference.py", "--config", train_config_path, "--model", "sovits5.0.pth", "--spk",
+               f"data_svc/singer/{resume_voice}", "--wave", "test.wav", "--shift", keychange, '--clean']
+        train_process = subprocess.run(cmd, shell=False, capture_output=True, text=True)
+        print(train_process.stdout)
+        print(train_process.stderr)
+        print("추론 성공")
+        return "svc_out.wav"
+class Info:
+    def __init__(self) -> None:
+        self.train = '''
+### 2023.7.11\n
+@OOPPEENN(https://github.com/OOPPEENN)第一次编写\n
+@thestmitsuk(https://github.com/thestmitsuki)二次补完\n
+@OOPPEENN(https://github.com/OOPPEENN)is written for the first time\n
+@thestmitsuki(https://github.com/thestmitsuki)Secondary completion
+        '''
+        self.inference = '''
+### 2023.7.11\n
+@OOPPEENN(https://github.com/OOPPEENN)第一次编写\n
+@thestmitsuk(https://github.com/thestmitsuki)二次补完\n
+@OOPPEENN(https://github.com/OOPPEENN)is written for the first time\n
+@thestmitsuki(https://github.com/thestmitsuki)Secondary completion
+        '''
+def check_pretrained():
+    links = {
+        'hubert_pretrain/hubert-soft-0d54a1f4.pt': 'https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt',
+        'speaker_pretrain/best_model.pth.tar': 'https://drive.google.com/uc?id=1UPjQ2LVSIt3o-9QMKMJcdzT8aZRZCI-E',
+        'speaker_pretrain/config.json': 'https://raw.githubusercontent.com/PlayVoice/so-vits-svc-5.0/9d415f9d7c7c7a131b89ec6ff633be10739f41ed/speaker_pretrain/config.json',
+        'whisper_pretrain/large-v2.pt': 'https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt',
+        'crepe/assets/full.pth': 'https://github.com/maxrmorrison/torchcrepe/raw/master/torchcrepe/assets/full.pth',
+        'vits_pretrain/sovits5.0.pretrain.pth': 'https://github.com/PlayVoice/so-vits-svc-5.0/releases/download/5.0/sovits5.0.pretrain.pth',
+    }
+    links_to_download = {}
+    for path, link in links.items():
+        if not os.path.isfile(path):
+            links_to_download[path] = link
+    if len(links_to_download) == 0:
+        print("사전 학습 모델이 모두 존재합니다.")
+        return
+    import gdown
+    import requests
+    def download(url, path):
+        r = requests.get(url, allow_redirects=True)
+        open(path, 'wb').write(r.content)
+    for path, url in links_to_download.items():
+        if not os.path.exists(os.path.dirname(path)):
+            os.makedirs(os.path.dirname(path))
+        print(f"사전 학습 모델 {path} 다운로드 중...")
+        if "drive.google.com" in url:
+            gdown.download(url, path, quiet=False)
+        else:
+            download(url, path)
+        print(f"사전 학습 모델 {path} 다운로드 완료")
+    print("모든 사전 학습 모델이 다운로드 되었습니다.")
+    return
+def check_transformers():
+    try:
+        import transformers
+        del transformers
+    except:
+        print("transformers 라이브러리를 설치합니다.")
+        os.system(f"{sys.executable} -m pip install transformers")
+        print("transformers 라이브러리 설치 완료")
+    return
+def check_tensorboard():
+    try:
+        import tensorboard
+        del tensorboard
+    except:
+        print("tensorboard 라이브러리를 설치합니다.")
+        os.system(f"{sys.executable} -m pip install tensorboard")
+        print("tensorboard 라이브러리 설치 완료")
+    return
+if __name__ == "__main__":
+    check_pretrained()
+    check_transformers()
+    check_tensorboard()
+    webui = WebUI()

automated_pipeline.sh ADDED Viewed

	@@ -0,0 +1,97 @@

+#!/bin/bash
+#SBATCH --job-name=cfm_full_pipeline
+#SBATCH --partition=a100
+#SBATCH --gres=gpu:1
+#SBATCH --cpus-per-task=8
+#SBATCH --mem=64G
+#SBATCH --time=120:00:00
+#SBATCH --output=logs/pipeline_%j.out
+#SBATCH --error=logs/pipeline_%j.err
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=hl3025@imperial.ac.uk
+set -e  # Exit on any error
+# Navigate to project directory
+cd /vol/bitbucket/hl3025/cfm_svc
+# Activate environment
+source .venv_linux/bin/activate
+# Export environment variables
+export PIP_CACHE_DIR=/vol/bitbucket/hl3025/pip_cache
+export TMPDIR=/vol/bitbucket/hl3025/tmp
+# Prevent BLAS/OpenMP from spawning too many threads
+export OMP_NUM_THREADS=1
+export OPENBLAS_NUM_THREADS=1
+export MKL_NUM_THREADS=1
+export VECLIB_MAXIMUM_THREADS=1
+export NUMEXPR_NUM_THREADS=1
+# Force Python output to be unbuffered so logs stream instantly
+export PYTHONUNBUFFERED=1
+# Create logs directory if it doesn't exist
+mkdir -p logs
+echo "======================================"
+echo "Starting CFM SVC Automated Pipeline"
+echo "======================================"
+echo "Start time: $(date)"
+# ============================================================================
+# STAGE 1: Data Preprocessing
+# ============================================================================
+echo ""
+echo "STAGE 1: Data Preprocessing with 8 threads..."
+echo "Time: $(date)"
+python svc_preprocessing.py -t 8
+# ============================================================================
+# STAGE 2: Codec Targets Generation
+# ============================================================================
+echo ""
+echo "STAGE 2: Generating Codec Targets..."
+echo "Time: $(date)"
+python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets
+# ============================================================================
+# STAGE 3: Teacher Model Distillation (Offline)
+# ============================================================================
+echo ""
+echo "STAGE 3: Offline Teacher Distillation..."
+echo "Time: $(date)"
+python preprocess_teacher.py \
+    --teacher_ckpt vits_pretrain/sovits5.0.pretrain.pth \
+    --teacher_config configs/base.yaml \
+    --codec_target_dir ./data_svc/codec_targets \
+    --data_root ./data_svc \
+    --out_dir ./data_svc/teacher_codec_targets \
+    --log_interval 200
+# ============================================================================
+# STAGE 4: CFM Training
+# ============================================================================
+echo ""
+echo "STAGE 4: CFM Training with Teacher Distillation..."
+echo "Time: $(date)"
+python train_cfm.py \
+    --data_dir ./data_svc/codec_targets \
+    --teacher_target_dir ./data_svc/teacher_codec_targets \
+    --lambda_teacher 0 \
+    --batch_size 16 \
+    --lr 1e-4 \
+    --num_workers 4 \
+    --epochs 200 \
+    --log_interval 50 \
+    --save_interval 10
+# ============================================================================
+# Pipeline Complete
+# ============================================================================
+echo ""
+echo "======================================"
+echo "CFM SVC Automated Pipeline Complete!"
+echo "======================================"
+echo "End time: $(date)"

build_faiss_index.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import argparse
+import glob
+import os
+import faiss
+import numpy as np
+from tqdm import tqdm
+def build_index(speaker_dir, output_path):
+    print(f"Finding HuBERT features in {speaker_dir}...")
+    vec_files = glob.glob(os.path.join(speaker_dir, "*.vec.npy"))
+    if not vec_files:
+        print(f"No .vec.npy files found in {speaker_dir}!")
+        return
+    print(f"Found {len(vec_files)} files. Loading vectors...")
+    all_vectors = []
+    for f in tqdm(vec_files):
+        vec = np.load(f) # (T, 256)
+        all_vectors.append(vec)
+    all_vectors = np.concatenate(all_vectors, axis=0).astype(np.float32)
+    print(f"Total frames: {all_vectors.shape[0]}, Feature dimension: {all_vectors.shape[1]}")
+    # Initialize FAISS index
+    # We use IndexFlatL2 for exact nearest neighbor search based on L2 distance.
+    index = faiss.IndexFlatL2(all_vectors.shape[1])
+    print("Adding vectors to FAISS index...")
+    index.add(all_vectors)
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    print(f"Saving index to {output_path}...")
+    faiss.write_index(index, output_path)
+    # Save the original vectors as well so we can retrieve them and average them
+    vectors_path = output_path.replace(".index", "_vectors.npy")
+    print(f"Saving source vectors to {vectors_path}...")
+    np.save(vectors_path, all_vectors)
+    print("Done!")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--speaker_dir", type=str, required=True, help="Path to speaker's HuBERT directory (e.g. data_svc/hubert/singer_0005)")
+    parser.add_argument("--output_path", type=str, required=True, help="Where to save the .index file (e.g. data_svc/hubert/singer_0005/feature.index)")
+    args = parser.parse_args()
+    build_index(args.speaker_dir, args.output_path)

build_mmap.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import os
+import glob
+import numpy as np
+from tqdm import tqdm
+import argparse
+def build_mmap(data_dir, feature_name, output_prefix):
+    """
+    Combines all .npy files for a feature (e.g. hubert, ppg) into a single large
+    memory-mapped array alongside an index file for fast O(1) lookups.
+    """
+    print(f"Finding {feature_name} features in {data_dir}...")
+    files = glob.glob(os.path.join(data_dir, "**", "*.npy"), recursive=True)
+    if not files:
+        print(f"No {feature_name} files found!")
+        return
+    # We don't want to load them all into RAM at once, so we do two passes.
+    # First pass: find total frames and index mapping.
+    print("Pass 1: calculating total frames and indexing...")
+    total_frames = 0
+    dim = None
+    dtype = None
+    index_map = {} # { filename: (start_idx, length) }
+    valid_files = []
+    for f in tqdm(files):
+        # We can memory map just to get shape/dtype quickly
+        try:
+            arr = np.load(f, mmap_mode='r')
+            if dim is None:
+                dim = arr.shape[1] if len(arr.shape) > 1 else 1
+                dtype = arr.dtype
+            length = arr.shape[0]
+            # Use relative path as key
+            rel_path = os.path.relpath(f, start=data_dir)
+            index_map[rel_path] = (total_frames, length)
+            total_frames += length
+            valid_files.append((f, rel_path, length))
+        except Exception as e:
+            pass
+    print(f"Total valid files: {len(valid_files)}")
+    print(f"Total frames: {total_frames}, Feature dim: {dim}")
+    # Second pass: allocate mmap and write
+    mmap_path = f"{output_prefix}.npy"
+    index_path = f"{output_prefix}_index.npy"
+    shape = (total_frames, dim) if dim > 1 else (total_frames,)
+    print(f"Allocating mmap at {mmap_path} with shape {shape}...")
+    mmap_arr = np.lib.format.open_memmap(mmap_path, mode='w+', dtype=dtype, shape=shape)
+    print("Pass 2: writing data to mmap...")
+    for f, rel_path, length in tqdm(valid_files):
+        start_idx, _ = index_map[rel_path]
+        arr = np.load(f)
+        if dim == 1:
+            mmap_arr[start_idx : start_idx + length] = arr
+        else:
+            mmap_arr[start_idx : start_idx + length, :] = arr
+    mmap_arr.flush()
+    print(f"Saving index map to {index_path}...")
+    np.save(index_path, index_map)
+    print("Done!")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data_dir", type=str, required=True, help="Base dir (e.g. data_svc/hubert)")
+    parser.add_argument("--feature", type=str, required=True, help="Feature name for printing")
+    parser.add_argument("--out_prefix", type=str, required=True, help="Output path prefix (e.g. data_svc/hubert_mmap)")
+    args = parser.parse_args()
+    build_mmap(args.data_dir, args.feature, args.out_prefix)

compare_audio.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import soundfile as sf
+import librosa
+import numpy as np
+wav_gt, sr = librosa.load('test_train_gt.wav', sr=44100)
+wav_pred, _ = librosa.load('test_overfit_pe.wav', sr=44100)
+min_len = min(len(wav_gt), len(wav_pred))
+# Calculate spectral diff
+S_gt = np.abs(librosa.stft(wav_gt[:min_len]))
+S_pred = np.abs(librosa.stft(wav_pred[:min_len]))
+diff = np.mean(np.abs(S_gt - S_pred))
+print("Spectral Mean Absolute Error:", diff)
+# Write a mix to see if they are identical but one is delayed
+mix = np.zeros_like(wav_gt[:min_len])
+mix = (wav_gt[:min_len] + wav_pred[:min_len]) / 2
+sf.write('test_mix.wav', mix, 44100)

configs/base.yaml ADDED Viewed

	@@ -0,0 +1,71 @@

+train:
+  model: "sovits"
+  seed: 1234
+  epochs: 10
+  learning_rate: 5e-5
+  betas: [0.8, 0.99]
+  lr_decay: 0.999875
+  eps: 1e-9
+  batch_size: 2
+  c_stft: 9
+  c_mel: 1.
+  c_kl: 0.2
+  port: 8001
+  pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
+#############################
+data:
+  training_files: "files/train.txt"
+  validation_files: "files/valid.txt"
+  segment_size: 8000  # WARNING: base on hop_length
+  max_wav_value: 32768.0
+  sampling_rate: 32000
+  filter_length: 1024
+  hop_length: 320
+  win_length: 1024
+  mel_channels: 100
+  mel_fmin: 50.0
+  mel_fmax: 16000.0
+#############################
+vits:
+  ppg_dim: 1280
+  vec_dim: 256
+  spk_dim: 256
+  gin_channels: 256
+  inter_channels: 192
+  hidden_channels: 192
+  filter_channels: 640
+#############################
+gen:
+  upsample_input: 192
+  upsample_rates: [5,4,4,2,2]
+  upsample_kernel_sizes: [15,8,8,4,4]
+  upsample_initial_channel: 320
+  resblock_kernel_sizes: [3,7,11]
+  resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
+#############################
+mpd:
+  periods: [2,3,5,7,11]
+  kernel_size: 5
+  stride: 3
+  use_spectral_norm: False
+  lReLU_slope: 0.2
+#############################
+mrd:
+  resolutions: "[(1024, 120, 600), (2048, 240, 1200), (4096, 480, 2400), (512, 50, 240)]" # (filter_length, hop_length, win_length)
+  use_spectral_norm: False
+  lReLU_slope: 0.2
+#############################
+log:
+  info_interval: 100
+  eval_interval: 10
+  save_interval: 10
+  num_audio: 6
+  pth_dir: 'chkpt'
+  log_dir: 'logs'
+  keep_ckpts: 0
+#############################
+dist_config:
+  dist_backend: "nccl"
+  dist_url: "tcp://localhost:54321"
+  world_size: 1

configs/singers/singer0001.npy ADDED Viewed