aaaaaaaaaff alefiury commited on
Commit
0eb9930
·
0 Parent(s):

Duplicate from alefiury/free-svc

Browse files

Co-authored-by: Alef Iury Siqueira Ferreira <alefiury@users.noreply.huggingface.co>

Files changed (8) hide show
  1. .gitattributes +35 -0
  2. G_00014_0225000.pth +3 -0
  3. README.md +94 -0
  4. common.yaml +22 -0
  5. config.yaml +93 -0
  6. hyperparams.yaml +33 -0
  7. rmvpe.pt +3 -0
  8. spin.ckpt +3 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
G_00014_0225000.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a132e8cb7656b69064fc705c65daf79cda76be1e435de5a6cb6126802f84b1e
3
+ size 861739692
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ language:
4
+ - en
5
+ - pt
6
+ - es
7
+ - zh
8
+ - nl
9
+ - fr
10
+ - de
11
+ - it
12
+ - ja
13
+ - pl
14
+ pipeline_tag: audio-to-audio
15
+ tags:
16
+ - audio
17
+ - voice
18
+ - voice conversion
19
+ - singing voice conversion
20
+ - vc
21
+ - svc
22
+ - multilingual
23
+ ---
24
+
25
+ # FreeSVC: Zero-shot Multilingual Singing Voice Conversion
26
+
27
+ **FreeSVC** is a promising multilingual zero-shot singing voice conversion model. It enables the conversion of singing voices across languages without the need for extensive language-specific training. [GitHub repository](https://github.com/freds0/free-svc). [Paper arXiv pre-print](https://arxiv.org/abs/2501.05586).
28
+
29
+ ## Supported Languages
30
+
31
+ | Language | ID | Status | Speech Data | Singing Data |
32
+ |------------|-----|--------------|-------------|--------------|
33
+ | Chinese | 0 | ✅ Full | 255h | 70h |
34
+ | Dutch | 1 | ✅ Full | Part of CML | - |
35
+ | English | 2 | ✅ Full | 921h | 47h |
36
+ | French | 3 | ✅ Full | Part of CML | - |
37
+ | German | 4 | ✅ Full | Part of CML | - |
38
+ | Italian | 5 | ✅ Full | Part of CML | - |
39
+ | Japanese | 6 | ✅ Full | 30h | - |
40
+ | Other* | 7 | ⚠️ Partial | - | 10h |
41
+ | Polish | 8 | ✅ Full | Part of CML | - |
42
+ | Portuguese | 9 | ✅ Full | Part of CML | - |
43
+ | Spanish | 10 | ✅ Full | Part of CML | - |
44
+
45
+ *Note: The "Other" category is used for vocal techniques without content.
46
+
47
+ ## Model Overview
48
+ FreeSVC leverages an enhanced VITS architecture integrated with Speaker-invariant Clustering (SPIN) and the ECAPA2 speaker encoder. This combination effectively separates speaker characteristics from linguistic content, ensuring high-quality and natural-sounding voice conversions across multiple languages.
49
+
50
+ ## Training Datasets
51
+
52
+ FreeSVC was trained on a diverse set of speech and singing datasets covering multiple languages:
53
+
54
+ | **Dataset** | **Hours** | **Language** | **Type** |
55
+ |----------------------|------------|--------------|--------------|
56
+ | AISHELL-1 | 170h | Chinese | Speech |
57
+ | AISHELL-3 | 85h | Chinese | Speech |
58
+ | CML-TTS | 3.1k | 7 Languages | Speech |
59
+ | HiFiTTS | 292h | English | Speech |
60
+ | JVS | 30h | Japanese | Speech |
61
+ | LibriTTS-R | 585h | English | Speech |
62
+ | NUS (NHSS) | 7h | English | Speech, Singing |
63
+ | OpenSinger | 50h | Chinese | Singing |
64
+ | Opencpop | 5h | Chinese | Singing |
65
+ | PopBuTFy | 10h, 40h | Chinese, English | Singing |
66
+ | POPCS | 5h | Chinese | Singing |
67
+ | VCTK | 44h | English | Speech |
68
+ | VocalSet | 10h | Other | Singing |
69
+
70
+ ## License
71
+
72
+ FreeSVC is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license. This means:
73
+
74
+ - The model **can only be used for research and non-commercial purposes**. Any commercial use is strictly prohibited.
75
+ - Any derivative works must be **shared under the same license**.
76
+ - Proper attribution must be given when using the model.
77
+
78
+ Users must also **comply with the licenses of the original datasets** used for training. Some datasets may have additional restrictions beyond CC BY-NC-SA 4.0. Ensure you review and adhere to their terms before using the model.
79
+
80
+ For full details, refer to the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/).
81
+
82
+ ## Citation
83
+ ```
84
+ @INPROCEEDINGS{10890068,
85
+ author={Ferreira, Alef Iury and Gris, Lucas Rafael and Da Rosa, Augusto and Oliveira, Frederico and Casanova, Edresson and Sousa, Rafael and Junior, Arnaldo and Soares, Anderson and Filho, Arlindo Galvão},
86
+ booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
87
+ title={FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion},
88
+ year={2025},
89
+ volume={},
90
+ number={},
91
+ pages={1-5},
92
+ keywords={Training;Source coding;Zero shot learning;Refining;Signal processing;Data models;Acoustics;Multilingual;Data mining;Speech synthesis;Singing Voice Conversion;Synthesis of Singing Voices;Cross-lingual and multilingual aspects in speech synthesis},
93
+ doi={10.1109/ICASSP49660.2025.10890068}}
94
+ ```
common.yaml ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ path: ./logs/${hydra.job.config_name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
2
+
3
+ log_level: INFO
4
+ seed: 1
5
+ tb_log_dir: tensorboard
6
+ tqdm: true
7
+
8
+ hydra:
9
+ run:
10
+ dir: ${path}
11
+ job_logging:
12
+ formatters:
13
+ colorlog:
14
+ format: '[%(cyan)s%(asctime)s%(reset)s][%(blue)s%(name)s:%(lineno)s:%(funcName)s()%(reset)s][%(log_color)s%(levelname)s%(reset)s]
15
+ - %(message)s'
16
+ handlers:
17
+ file:
18
+ filename: ${hydra.run.dir}/${hydra.job.name}_${now:%Y-%m-%d}_${now:%H-%M-%S}.log
19
+
20
+ defaults:
21
+ - override hydra/job_logging: colorlog
22
+ - override hydra/hydra_logging: colorlog
config.yaml ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ defaults:
2
+ - common
3
+
4
+ train:
5
+ batch_size: 128
6
+ betas: [0.8, 0.99]
7
+ c_kl: 1.0
8
+ c_mel: 45
9
+ distributed: false # BUG: multi-gpu is not working
10
+ use_multiprocessing: false # BUG: multi-gpu is not working
11
+ epochs: 20
12
+ eps: 1e-9
13
+ fp16_run: false
14
+ init_lr_ratio: 1
15
+ raise_error: false
16
+ learning_rate: 2e-4
17
+ log_interval: 10
18
+ log_level: ${log_level}
19
+ lr_decay: 0.98
20
+ max_speclen: 128
21
+ port: 8005
22
+ resume_training: false # set to false to finetune from a model
23
+ seed: 1234
24
+ segment_size: 8960
25
+ use_sr: false
26
+ valid_epoch_interval: 1
27
+ valid_steps_interval: 1000
28
+ save_epoch_interval: 10
29
+ save_steps_interval: 1000
30
+ warmup_epochs: 0
31
+ # weighted_batch_speaker_sampling : false
32
+ # weighted_batch_lang_sampling : false
33
+ weighted_batch_speaker_sampling : 0.5
34
+ weighted_batch_lang_sampling : 0.5
35
+
36
+ data:
37
+ dataset_dir: /raid/lucasgris/free-svc/data
38
+ filter_length: 1280
39
+ hop_length: 320
40
+ max_wav_value: 32768.0
41
+ mel_fmax: null
42
+ mel_fmin: 0.0
43
+ n_mel_channels: 80
44
+ num_workers: 64
45
+ # For pitch extraction, set the pitch_predictor (will compute in dataloader) or pitch_features_dir (will load from disk)
46
+ pitch_predictor: rmvpe # pm | crepe | harvest | dio | rmvpe | fcpe
47
+ pitch_features_dir: ${data.dataset_dir}/pitch_features/
48
+ sampling_rate: 24000
49
+ spectrogram_dir: null #${data.dataset_dir}/spectrograms # it is recommended NOT to use if you have small disk space
50
+ # For speaker embedding extraction, set the use_spk_emb to True and spk_embeddings_dir (will load from disk) or configure the model to compute it on forward
51
+ use_spk_emb: true
52
+ spk_embeddings_dir: ${data.dataset_dir}/spk_embeddings
53
+ # SR augmentation is deprecated, set use_sr to False
54
+ sr_min_max: [68, 92]
55
+ # For content feature extraction, set the content_feature_dir (will load from disk) or configure the model to compute it on forward
56
+ content_feature_dir: null
57
+ training_files: data/train.csv
58
+ validation_files: data/valid.csv
59
+ win_length: 1280
60
+
61
+ model:
62
+ save_dir: null
63
+ filter_channels: 768
64
+ finetune_from_model:
65
+ discriminator: /raid/lucasgris/free-svc/D-freevc-24.pth
66
+ generator: /raid/lucasgris/free-svc/freevc-24.pth
67
+ hidden_channels: 192
68
+ inter_channels: 192
69
+ kernel_size: 3
70
+ n_heads: 2
71
+ n_layers_q: 3
72
+ n_layers: 6
73
+ p_dropout: 0.1
74
+ resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
75
+ resblock_kernel_sizes: [3,7,11]
76
+ resblock: 1
77
+ c_dim: 768
78
+ upsample_initial_channel: 512
79
+ upsample_kernel_sizes: [16,16,4,4]
80
+ upsample_rates: [10,8,2,2]
81
+ use_spectral_norm: false
82
+ freeze_external_spk: true
83
+ device: cuda
84
+ # For online speaker embedding extraction, set the use_spk_emb to True and spk_encoder_type
85
+ use_spk_emb: false
86
+ gin_channels: null # gin_channels = spk_encoder.embedding_dim
87
+ spk_encoder_type: null # ECAPA2SpeakerEncoder16k |
88
+ # For content feature extraction, set the content_encoder_type and content_encoder_ckpt
89
+ content_encoder_type: null # load from disk (data) - hubert | wavlm
90
+ content_encoder_ckpt: null # load from disk (data) - [path] | models/wavlm/WavLM-Large.pt | lengyue233/content-vec-best
91
+ post_content_encoder_type: vits-encoder-with-uv-emb # or freevc-bottleneck
92
+ coarse_f0: true
93
+ cond_f0_on_flow: false
hyperparams.yaml ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ defaults:
2
+ - common
3
+ - config
4
+
5
+ data:
6
+ use_lang_emb: true
7
+ num_langs: 11
8
+ lang_dim: 192 # same size as hidden_channels to facilitate the addition
9
+ lang2id:
10
+ chinese: 0
11
+ dutch: 1
12
+ english: 2
13
+ french: 3
14
+ german: 4
15
+ italian: 5
16
+ japanese: 6
17
+ other: 7
18
+ polish: 8
19
+ portuguese: 9
20
+ spanish: 10
21
+ use_spk_emb: false
22
+ spk_embeddings_dir: null # compute on forward (model)
23
+ spk_encoder_type: null # compute on forward (model) | ECAPA2SpeakerEncoder16k
24
+ content_encoder_type: null # compute on forward (model) | hubert
25
+ content_encoder_ckpt: null # compute on forward (model) | lengyue233/content-vec-best
26
+
27
+ model:
28
+ use_spk_emb: true
29
+ spk_encoder_type: ECAPA2SpeakerEncoder16k
30
+ spk_encoder_ckpt: null # Not used for ECAPA2SpeakerEncoder16k
31
+ content_encoder_type: spin # hubert | wavlm | spin
32
+ content_encoder_config: models/spin/spin.yaml # path to the config file for the content encoder
33
+ content_encoder_ckpt: models/spin/spin.ckpt # or models/wavlm/WavLM-Large.pt
rmvpe.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d49bd662038808878c9d7420e0f583f506fe69086cc384f0da88f0b3a4e1115
3
+ size 368492925
spin.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08b2f5082bc4b4748640a67316feaf4bc577d333d1af7f85cabf5b8fe816f6ee
3
+ size 500185599