A newer version of the Gradio SDK is available: 6.13.0
Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS
- This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
- This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
- This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
- This project will not develop one-click packages for other purposes;
6GB low minimum VRAM requirement for training
support for multiple speakers
create unique speakers through speaker mixing
even voices with light accompaniment can also be converted
F0 can be edited using Excel
https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/6a09805e-ab93-47fe-9a14-9cbc1e0e7c3a
Powered by @ShadowVap
Model properties
| Feature | From | Status | Function |
|---|---|---|---|
| whisper | OpenAI | β | strong noise immunity |
| bigvgan | NVIDA | β | alias and snake |
| natural speech | Microsoft | β | reduce mispronunciation |
| neural source-filter | NII | β | solve the problem of audio F0 discontinuity |
| speaker encoder | β | Timbre Encoding and Clustering | |
| GRL for speaker | Ubisoft | β | Preventing Encoder Leakage Timbre |
| SNAC | Samsung | β | One Shot Clone of VITS |
| SCLN | Microsoft | β | Improve Clone |
| PPG perturbation | this project | β | Improved noise immunity and de-timbre |
| HuBERT perturbation | this project | β | Improved noise immunity and de-timbre |
| VAE perturbation | this project | β | Improve sound quality |
| MIX encoder | this project | β | Improve conversion stability |
| USP infer | this project | β | Improve conversion stability |
due to the use of data perturbation, it takes longer to train than other projects.
USP : Unvoice and Silence with Pitch when infer
Quick Installation
# clone project
git clone https://github.com/ouor/so-vits-svc-5.0
# create virtual environment
python -m venv .venv
# activate virtual environment
.venv\Scripts\activate
# install pytorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
# install dependencies
pip install -r requirements.txt
# run app.py
python app.py
Setup Environment
Install PyTorch.
Install project dependencies
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txtNote: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tarintospeaker_pretrain/.Download whisper model whisper-large-v2. Make sure to download
large-v2.ptοΌput it intowhisper_pretrain/.Download hubert_soft modelοΌput
hubert-soft-0d54a1f4.ptintohubert_pretrain/.Download pitch extractor crepe fullοΌput
full.pthintocrepe/assets.Download pretrain model sovits5.0.pretrain.pth, and put it into
vits_pretrain/.python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
Dataset preparation
Necessary pre-processing:
- Separate vocie and accompaniment with UVR (skip if no accompaniment)
- Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
- Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
- Adjust loudness if necessary, recommand Adobe Audiiton.
- Put the dataset into the
dataset_rawdirectory following the structure below.
dataset_raw
ββββspeaker0
β ββββ000001.wav
β ββββ...
β ββββ000xxx.wav
ββββspeaker1
ββββ000001.wav
ββββ...
ββββ000xxx.wav
Data preprocessing
python sve_preprocessing.py -t 2
-t: threading, max number should not exceed CPU core count, usually 2 is enough.
After preprocessing you will get an output with following structure.
data_svc/
βββ waves-16k
β βββ speaker0
β β βββ 000001.wav
β β βββ 000xxx.wav
β βββ speaker1
β βββ 000001.wav
β βββ 000xxx.wav
βββ waves-32k
β βββ speaker0
β β βββ 000001.wav
β β βββ 000xxx.wav
β βββ speaker1
β βββ 000001.wav
β βββ 000xxx.wav
βββ pitch
β βββ speaker0
β β βββ 000001.pit.npy
β β βββ 000xxx.pit.npy
β βββ speaker1
β βββ 000001.pit.npy
β βββ 000xxx.pit.npy
βββ hubert
β βββ speaker0
β β βββ 000001.vec.npy
β β βββ 000xxx.vec.npy
β βββ speaker1
β βββ 000001.vec.npy
β βββ 000xxx.vec.npy
βββ whisper
β βββ speaker0
β β βββ 000001.ppg.npy
β β βββ 000xxx.ppg.npy
β βββ speaker1
β βββ 000001.ppg.npy
β βββ 000xxx.ppg.npy
βββ speaker
β βββ speaker0
β β βββ 000001.spk.npy
β β βββ 000xxx.spk.npy
β βββ speaker1
β βββ 000001.spk.npy
β βββ 000xxx.spk.npy
βββ singer
βββ speaker0.spk.npy
βββ speaker1.spk.npy
Re-sampling
- Generate audio with a sampling rate of 16000Hz in
./data_svc/waves-16k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000- Generate audio with a sampling rate of 32000Hz in
./data_svc/waves-32k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000- Generate audio with a sampling rate of 16000Hz in
Use 16K audio to extract pitch
python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitchUse 16K audio to extract ppg
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisperUse 16K audio to extract hubert
python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubertUse 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speakerExtract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singeruse 32k audio to extract the linear spectrum
python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specsUse 32k audio to generate training index
python prepare/preprocess_train.pyTraining file debugging
python prepare/preprocess_zzz.py
Train
If fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0.pretrain.pth. Put pretrained model under project root, change this line
pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"in
configs/base.yamlοΌand adjust the learning rate appropriately, eg 5e-5.batch_szie: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower.Start training
python svc_trainer.py -c configs/base.yaml -n sovits5.0Resume training
python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pthLog visualization
tensorboard --logdir logs/
Inference
Export inference model: text encoder, Flow network, Decoder network
python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.ptInference
- if there is no need to adjust
f0, just run the following command.
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0- if
f0will be adjusted manually, follow the steps:- use whisper to extract content encoding, generate
test.vec.npy.
python whisper/inference.py -w test.wav -p test.ppg.npy- use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage
python hubert/inference.py -w test.wav -v test.vec.npy- extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
python pitch/inference.py -w test.wav -p test.csv- final inference
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0 - use whisper to extract content encoding, generate
- if there is no need to adjust
Notes
when
--ppgis specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;when
--vecis specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;when
--pitis specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;generate files in the current directory:svc_out.wav
Arguments ref
args --config --model --spk --wave --ppg --vec --pit --shift name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shift
Creat singer
named by pure coincidenceοΌaverage -> ave -> evaοΌeve(eva) represents conception and reproduction
python svc_eva.py
eva_conf = {
'./configs/singers/singer0022.npy': 0,
'./configs/singers/singer0030.npy': 0,
'./configs/singers/singer0047.npy': 0.5,
'./configs/singers/singer0051.npy': 0.5,
}
the generated singer file will be eva.spk.npy.
Data set
Code sources and references
https://github.com/facebookresearch/speech-resynthesis paper
https://github.com/jaywalnut310/vits paper
https://github.com/openai/whisper/ paper
https://github.com/NVIDIA/BigVGAN paper
https://github.com/mindslab-ai/univnet paper
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/brentspell/hifi-gan-bwe
https://github.com/mozilla/TTS
https://github.com/bshall/soft-vc
https://github.com/maxrmorrison/torchcrepe
https://github.com/OlaWod/FreeVC paper
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Speaker normalization (GRL) for self-supervised speech emotion recognition
Method of Preventing Timbre Leakage Based on Data Perturbation
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
Contributors
Relevant Projects
- LoRA-SVC: decoder only svc
- NSF-BigVGAN: vocoder for more work