5Hyeons commited on
Commit
ddc047c
Β·
1 Parent(s): 371e46d

Add StyleTTS2+Vocos with AIHUB dataset model

Browse files
README.md CHANGED
@@ -1,4 +1,12 @@
1
- # Vocos LibriTTS Model
 
 
 
 
 
 
 
 
2
 
3
  This model was trained using the train-clean-100 and train-clean-360 subsets of the LibriTTS dataset.
4
 
@@ -10,15 +18,43 @@ This model was trained using the train-clean-100 and train-clean-360 subsets of
10
 
11
  The training and inference code can be found at: [StyleTTS2-Vocos](https://github.com/5Hyeons/StyleTTS2-Vocos)
12
 
13
- ### Folder Structure
 
 
 
 
 
 
 
14
  ```
15
  StyleTTS2
16
  β”œβ”€β”€ README.md
17
  └── Vocos
18
- └── LibriTTS
19
  └── [checkpoint files]
20
  ```
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## License
23
 
24
- This model is released under the MIT License. This is one of the most permissive open-source licenses, allowing for both commercial and non-commercial use, modification, and distribution.
 
1
+ # StyleTTS2 + Vocos with LibriTTS Dataset
2
+
3
+ ```
4
+ StyleTTS2
5
+ β”œβ”€β”€ README.md
6
+ └── Vocos
7
+ └── LibriTTS
8
+ └── [checkpoint files]
9
+ ```
10
 
11
  This model was trained using the train-clean-100 and train-clean-360 subsets of the LibriTTS dataset.
12
 
 
18
 
19
  The training and inference code can be found at: [StyleTTS2-Vocos](https://github.com/5Hyeons/StyleTTS2-Vocos)
20
 
21
+ ## License
22
+
23
+ This model is released under the MIT License. This is one of the most permissive open-source licenses, allowing for both commercial and non-commercial use, modification, and distribution.
24
+
25
+ ---
26
+
27
+ # StyleTTS2 + Vocos with AIHUB Dataset
28
+
29
  ```
30
  StyleTTS2
31
  β”œβ”€β”€ README.md
32
  └── Vocos
33
+ └── AIHUB
34
  └── [checkpoint files]
35
  ```
36
 
37
+ This model was trained using multiple datasets from AIHUB:
38
+
39
+ 1. **Korean Data** (~1000 hours)
40
+ - Source: [감성 및 λ°œν™”μŠ€νƒ€μΌ λ™μ‹œ κ³ λ € μŒμ„±ν•©μ„± 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71349)
41
+
42
+ 2. **English & Japanese Data** (~1000 hours)
43
+ - Source: [λ‹€κ΅­μ–΄ ν†΅Β·λ²ˆμ—­ 낭독체 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71524)
44
+
45
+ 3. **Chinese Data** (~500 hours)
46
+ - Source: [ν•œ-영 및 ν•œ-쀑 μŒμ„±λ°œν™” 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=71261)
47
+
48
+ Total samples: ~1.4M
49
+
50
+ ## Model Information
51
+
52
+ - **Model Architecture**: StyleTTS2 + Vocos
53
+ - **Training Data**: AIHUB Multilingual Dataset
54
+ - **License**: CC BY-NC 4.0
55
+
56
+ The training and inference code can be found at: [StyleTTS2-Vocos](https://github.com/5Hyeons/StyleTTS2-Vocos)
57
+
58
  ## License
59
 
60
+ This model is released under the CC BY-NC 4.0 License. This license allows for non-commercial use, modification, and distribution, as long as appropriate credit is given.
Vocos/AIHUB_ML/Data/OOD_texts_en_jp_ko_zh.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5887482e060d92e1bcb291b17a54e027df41e8269e53df3a47eaa6c0384f60d
3
+ size 41221638
Vocos/AIHUB_ML/Data/train_multi_lingual_en_jp_ko_zh_filelist.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c4245cc79786a292baa4dbd8b2302da554cd1e07c7027967b25c5931ff6b6b8
3
+ size 324929081
Vocos/AIHUB_ML/Data/valid_multi_lingual_en_jp_ko_zh_500_filelist.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd5c9878539bdf9fc4c12606a389c54ad940e86eeb29691461ae77e818b6f868
3
+ size 113157
Vocos/AIHUB_ML/config_aihub_multi_lingual_en_jp_ko_zh_vocos.yml ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ log_dir: "Models/aihub_multi_lingual_en_jp_ko_zh_vocos"
2
+ first_stage_path: "aihub_multi_lingual_en_jp_ko_zh_vocos_first_stage.pth"
3
+ save_freq: 1
4
+ log_interval: 10
5
+ device: "cuda"
6
+ epochs_1st: 10 # number of epochs for first stage training (pre-training)
7
+ epochs_2nd: 8 # number of peochs for second stage training (joint training)
8
+ batch_size: 8
9
+ max_len: 300 # maximum number of frames
10
+ pretrained_model: "Models/aihub_multi_lingual_en_jp_ko_zh_vocos/epoch_1st_00007_batch_08_step_107999.pth"
11
+ second_stage_load_pretrained: false # set to true if the pre-trained model is for 2nd stage
12
+ load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
13
+
14
+ F0_path: "Utils/JDC/bst.t7"
15
+ ASR_config: "Utils/ASR/config_aihub_multi_lingual_en_jp_ko_zh.yml"
16
+ ASR_path: "Utils/ASR/aihub_multi_lingual_en_jp_ko_zh_epoch_00011.pth"
17
+ PLBERT_dir: 'Utils/PLBERT/'
18
+
19
+ data_params:
20
+ # train_data: "Data/multi_lingual_train_filelist.txt.cleaned"
21
+ train_data: "Data/train_multi_lingual_en_jp_ko_zh_filelist.txt.cleaned.valid_word.after_asr"
22
+ val_data: "Data/valid_multi_lingual_en_jp_ko_zh_500_filelist.txt.cleaned.removed"
23
+ root_path: "wavs/"
24
+ OOD_data: "Data/OOD_texts_en_jp_ko_zh.txt"
25
+ min_length: 50 # sample until texts with this size are obtained for OOD texts
26
+
27
+ preprocess_params:
28
+ sr: 24000
29
+ spect_params:
30
+ n_fft: 2048
31
+ win_length: 1200
32
+ hop_length: 300
33
+
34
+ model_params:
35
+ multispeaker: true
36
+
37
+ dim_in: 64
38
+ hidden_dim: 512
39
+ max_conv_dim: 512
40
+ n_layer: 3
41
+ n_mels: 80
42
+
43
+ n_token: 498 # number of phoneme tokens
44
+ max_dur: 50 # maximum duration of a single phoneme
45
+ style_dim: 128 # style vector size
46
+
47
+ dropout: 0.2
48
+
49
+ # config for decoder
50
+ decoder:
51
+ type: 'vocos' # either hifigan or istftnet or vocos
52
+ intermediate_dim: 1536
53
+ num_layers: 8
54
+ gen_istft_n_fft: 1200
55
+ gen_istft_hop_size: 300
56
+
57
+ # speech language model config
58
+ slm:
59
+ model: 'microsoft/wavlm-base-plus'
60
+ sr: 16000 # sampling rate of SLM
61
+ hidden: 768 # hidden size of SLM
62
+ nlayers: 13 # number of layers of SLM
63
+ initial_channel: 64 # initial channels of SLM discriminator head
64
+
65
+ # style diffusion model config
66
+ diffusion:
67
+ embedding_mask_proba: 0.1
68
+ # transformer config
69
+ transformer:
70
+ num_layers: 3
71
+ num_heads: 8
72
+ head_features: 64
73
+ multiplier: 2
74
+
75
+ # diffusion distribution config
76
+ dist:
77
+ sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
78
+ estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
79
+ mean: -3.0
80
+ std: 1.0
81
+
82
+ loss_params:
83
+ lambda_mel: 5. # mel reconstruction loss
84
+ lambda_gen: 1. # generator loss
85
+ lambda_slm: 1. # slm feature matching loss
86
+
87
+ lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
88
+ lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
89
+ TMA_epoch: 0 # TMA starting epoch (1st stage)
90
+
91
+ lambda_F0: 1. # F0 reconstruction loss (2nd stage)
92
+ lambda_norm: 1. # norm reconstruction loss (2nd stage)
93
+ lambda_dur: 1. # duration loss (2nd stage)
94
+ lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
95
+ lambda_sty: 1. # style reconstruction loss (2nd stage)
96
+ lambda_diff: 1. # score matching loss (2nd stage)
97
+
98
+ diff_epoch: 1 # style diffusion starting epoch (2nd stage)
99
+ joint_epoch: 2 # joint training starting epoch (2nd stage)
100
+
101
+ optimizer_params:
102
+ lr: 0.0001 # general learning rate
103
+ bert_lr: 0.00001 # learning rate for PLBERT
104
+ ft_lr: 0.00001 # learning rate for acoustic modules
105
+
106
+ slmadv_params:
107
+ min_len: 400 # minimum length of samples
108
+ max_len: 500 # maximum length of samples
109
+ batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
110
+ iter: 20 # update the discriminator every this iterations of generator update
111
+ thresh: 5 # gradient norm above which the gradient is scaled
112
+ scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
113
+ sig: 1.5 # sigma for differentiable duration modeling
Vocos/AIHUB_ML/epoch_1st_00014.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ae9fd3788d9ed6e92228a298b4921d6cb4abf23a6a6644e70e4fb60d4b78654
3
+ size 2169017196
Vocos/AIHUB_ML/epoch_2nd_00006.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62bffd7fbe35142f23c9628b952e1e88918edc0ec3cd067454be4c70d3d1560a
3
+ size 2484562307