baibaibai commited on
Commit
2072d0c
·
1 Parent(s): f6881d0

Upload 39 files

Browse files
.gitattributes CHANGED
@@ -32,3 +32,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
32
  *.zip filter=lfs diff=lfs merge=lfs -text
33
  *.zst filter=lfs diff=lfs merge=lfs -text
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
32
  *.zip filter=lfs diff=lfs merge=lfs -text
33
  *.zst filter=lfs diff=lfs merge=lfs -text
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
35
+ samples/source.wav filter=lfs diff=lfs merge=lfs -text
36
+ samples/svc-kiritan+12key.wav filter=lfs diff=lfs merge=lfs -text
37
+ samples/svc-opencpop_kiritan_mix+12key.wav filter=lfs diff=lfs merge=lfs -text
38
+ samples/svc-opencpop+12key.wav filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 yxlllc
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,12 +1,150 @@
1
- ---
2
- title: DDSP
3
- emoji: 👁
4
- colorFrom: pink
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 3.23.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Language: **English** [简体中文](./cn_README.md)
2
+ # DDSP-SVC
3
+ <div align="center">
4
+ <img src="https://storage.googleapis.com/ddsp/github_images/ddsp_logo.png" width="200px" alt="logo"></img>
5
+ </div>
6
+ End-to-end singing voice conversion system based on DDSP (Differentiable Digital Signal Processing).
7
+
8
+ ## 0. Introduction
9
+ DDSP-SVC is a new open source singing voice conversion project dedicated to the development of free AI voice changer software that can be popularized on personal computers.
10
+
11
+ Compared with the more famous [Diff-SVC](https://github.com/prophesier/diff-svc) and [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc), its training and synthesis have much lower requirements for computer hardware, and the training time can be shortened by orders of magnitude.
12
+
13
+ Although the original synthesis quality of DDSP is not ideal (the original output can be heard in tensorboard while training), after using the pre-trained vocoder-based enhancer, the sound quality for some dateset can reach a level close to SO-VITS-SVC.
14
+
15
+ If the quality of the training data is very high, probably still Diff-SVC will have the highest sound quality. The demo outputs are in the `samples` folder, and the related model checkpoint can be downloaded from the release page.
16
+
17
+ Disclaimer: Please make sure to only train DDSP-SVC models with **legally obtained authorized data**, and do not use these models and any audio they synthesize for illegal purposes. The author of this repository is not responsible for any infringement, fraud and other illegal acts caused by the use of these model checkpoints and audio.
18
+
19
+ Update log: I am too lazy to translate, please see the Chinese version readme.
20
+
21
+ ## 1. Installing the dependencies
22
+ We recommend first installing PyTorch from the [**official website**](https://pytorch.org/), then run:
23
+ ```bash
24
+ pip install -r requirements.txt
25
+ ```
26
+ NOTE : I only test the code using python 3.8 (windows) + pytorch 1.9.1 + torchaudio 0.6.0, too new or too old dependencies may not work
27
+ ## 2. Configuring the pretrained model
28
+ UPDATE: ContentVec encoder is supported now. You can download the pretrained [ContentVec](https://ibm.ent.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr) encoder instead of HubertSoft encoder and modify the configuration file to use it.
29
+ - **(Required)** Download the pretrained [**HubertSoft**](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt) encoder and put it under `pretrain/hubert` folder.
30
+ - Get the pretrained vocoder-based enhancer from the [DiffSinger Community Vocoders Project](https://openvpi.github.io/vocoders) and unzip it into `pretrain/` folder
31
+ ## 3. Preprocessing
32
+
33
+ Put all the training dataset (.wav format audio clips) in the below directory:
34
+ `data/train/audio`.
35
+ Put all the validation dataset (.wav format audio clips) in the below directory:
36
+ `data/val/audio`.
37
+ You can also run
38
+ ```bash
39
+ python draw.py
40
+ ```
41
+ to help you select validation data (you can adjust the parameters in `draw.py` to modify the number of extracted files and other parameters)
42
+
43
+ Then run
44
+ ```bash
45
+ python preprocess.py -c configs/combsub.yaml
46
+ ```
47
+ for a model of combtooth substractive synthesiser (**recommend**), or run
48
+ ```bash
49
+ python preprocess.py -c configs/sins.yaml
50
+ ```
51
+ for a model of sinusoids additive synthesiser.
52
+
53
+ You can modify the configuration file `config/<model_name>.yaml` before preprocessing. The default configuration is suitable for training 44.1khz high sampling rate synthesiser with GTX-1660 graphics card.
54
+
55
+ NOTE 1: Please keep the sampling rate of all audio clips consistent with the sampling rate in the yaml configuration file ! If it is not consistent, the program can be executed safely, but the resampling during the training process will be very slow.
56
+
57
+ NOTE 2: The total number of the audio clips for training dataset is recommended to be about 1000, especially long audio clip can be cut into short segments, which will speed up the training, but the duration of all audio clips should not be less than 2 seconds. If there are too many audio clips, you need a large internal-memory or set the 'cache_all_data' option to false in the configuration file.
58
+
59
+ NOTE 3: The total number of the audio clips for validation dataset is recommended to be about 10, please don't put too many or it will be very slow to do the validation.
60
+
61
+ NOTE 4: If your dataset is not very high quality, set 'f0_extractor' to 'crepe' in the config file. The crepe algorithm has the best noise immunity, but at the cost of greatly increasing the time required for data preprocessing.
62
+
63
+ UPDATE: Multi-speaker training is supported now. The 'n_spk' parameter in configuration file controls whether it is a multi-speaker model. If you want to train a **multi-speaker** model, audio folders need to be named with **positive integers not greater than 'n_spk'** to represent speaker ids, the directory structure is like below:
64
+ ```bash
65
+ # training dataset
66
+ # the 1st speaker
67
+ data/train/audio/1/aaa.wav
68
+ data/train/audio/1/bbb.wav
69
+ ...
70
+ # the 2nd speaker
71
+ data/train/audio/2/ccc.wav
72
+ data/train/audio/2/ddd.wav
73
+ ...
74
+
75
+ # validation dataset
76
+ # the 1st speaker
77
+ data/val/audio/1/eee.wav
78
+ data/val/audio/1/fff.wav
79
+ ...
80
+ # the 2nd speaker
81
+ data/val/audio/2/ggg.wav
82
+ data/val/audio/2/hhh.wav
83
+ ...
84
+ ```
85
+ If 'n_spk' = 1, The directory structure of the **single speaker** model is still supported, which is like below:
86
+ ```bash
87
+ # training dataset
88
+ data/train/audio/aaa.wav
89
+ data/train/audio/bbb.wav
90
+ ...
91
+ # validation dataset
92
+ data/val/audio/ccc.wav
93
+ data/val/audio/ddd.wav
94
+ ...
95
+ ```
96
+
97
+ ## 4. Training
98
+ ```bash
99
+ # train a combsub model as an example
100
+ python train.py -c configs/combsub.yaml
101
+ ```
102
+ The command line for training other models is similar.
103
+
104
+ You can safely interrupt training, then running the same command line will resume training.
105
+
106
+ You can also finetune the model if you interrupt training first, then re-preprocess the new dataset or change the training parameters (batchsize, lr etc.) and then run the same command line.
107
+
108
+ ## 5. Visualization
109
+ ```bash
110
+ # check the training status using tensorboard
111
+ tensorboard --logdir=exp
112
+ ```
113
+ Test audio samples will be visible in TensorBoard after the first validation.
114
+
115
+ NOTE: The test audio samples in Tensorboard are the original outputs of your DDSP-SVC model that is not enhanced by an enhancer. If you want to test the synthetic effect after using the enhancer (which may have higher quality) , please use the method described in the following chapter.
116
+ ## 6. Testing
117
+ (**Recommend**) Enhance the output using the pretrained vocoder-based enhancer:
118
+ ```bash
119
+ # high audio quality in the normal vocal range if enhancer_adaptive_key = 0 (default)
120
+ # set enhancer_adaptive_key > 0 to adapt the enhancer to a higher vocal range
121
+ python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -eak <enhancer_adaptive_key (semitones)>
122
+ ```
123
+ Raw output of DDSP:
124
+ ```bash
125
+ # fast, but relatively low audio quality (like you hear in tensorboard)
126
+ python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -e false
127
+ ```
128
+ Other options about the f0 extractor and response threhold,see:
129
+ ```bash
130
+ python main.py -h
131
+ ```
132
+ (UPDATE) Mix-speaker is supported now. You can use "-mix" option to design your own vocal timbre, below is an example:
133
+ ```bash
134
+ # Mix the timbre of 1st and 2nd speaker in a 0.5 to 0.5 ratio
135
+ python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -mix "{1:0.5, 2:0.5}" -eak 0
136
+ ```
137
+ ## 7. HTTP Server and VST supported
138
+ Start the server with the following command
139
+ ```bash
140
+ # configs are in this python file, see the comments (Chinese only)
141
+ python flask_api.py
142
+ ```
143
+ Currently supported VST client:
144
+ https://github.com/zhaohui8969/VST_NetProcess-
145
+
146
+ ## 8. Acknowledgement
147
+ * [ddsp](https://github.com/magenta/ddsp)
148
+ * [pc-ddsp](https://github.com/yxlllc/pc-ddsp)
149
+ * [soft-vc](https://github.com/bshall/soft-vc)
150
+ * [DiffSinger (OpenVPI version)](https://github.com/openvpi/DiffSinger)
cn_README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Language: [English](./README.md) ** 简体中文 **
2
+ # DDSP-SVC
3
+ <div align="center">
4
+ <img src="https://storage.googleapis.com/ddsp/github_images/ddsp_logo.png" width="200px" alt="logo"></img>
5
+ </div>
6
+ 基于 DDSP(可微分数字信号处理)的端到端歌声转换系统
7
+
8
+ ## 0.简介
9
+ DDSP-SVC 是一个新的开源歌声转换项目,致力于开发可以在个人电脑上普及的自由 AI 变声器软件。
10
+
11
+ 相比于比较著名的 [Diff-SVC](https://github.com/prophesier/diff-svc) 和 [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc), 它训练和合成对电脑硬件的要求要低的多,并且训练时长有数量级的缩短。
12
+
13
+ 虽然 DDSP 的原始合成质量不是很理想(训练时在 tensorboard 中可以听到原始输出),但在使用基于预训练声码器的增强器增强音质后,对于部分数据集可以达到接近 SOVITS-SVC 的合成质量。
14
+
15
+ 如果训练数据的质量非常高,可能仍然 Diff-SVC 将拥有最高的合成质量。在`samples`文件夹中包含合成示例,相关模型检查点可以从仓库发布页面下载。
16
+
17
+ 免责声明:请确保仅使用**合法获得的授权数据**训练 DDSP-SVC 模型,不要将这些模型及其合成的任何音频用于非法目的。 本库作者不对因使用这些模型检查点和音频而造成的任何侵权,诈骗等违法行为负责。
18
+
19
+ 1.1 更新:支持多说话人和音色混合
20
+
21
+ 2.0 更新:开始支持实时 vst 插件,并优化了 combsub 模型, 训练速度极大提升。旧的 combsub 模型仍然兼容,可用 combsub-old.yaml 训练,sins 模型不受影响,但由于训练速度远慢于 combsub, 目前版本已经不推荐使用
22
+
23
+ ## 1. 安装依赖
24
+ 我们推荐从 [**PyTorch 官方网站 **](https://pytorch.org/) 下载 PyTorch.
25
+
26
+ 接着运行
27
+ ```bash
28
+ pip install -r requirements.txt
29
+ ```
30
+ 注: 我只在 python 3.8 (windows) + pytorch 1.9.1 + torchaudio 0.6.0 测试过代码,太旧或太新的依赖可能会报错。
31
+ ## 2. 配置预训练模型
32
+ 更新:现在支持 ContentVec 编码器了。你可以下载预训练 [ContentVec](https://ibm.ent.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr) 编码器替代 HubertSoft 编码器并修改配置文件以使用它。
33
+ - **(必要操作)** 下载预训练 [**HubertSoft**](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt) 编码器并将其放到 `pretrain/hubert` 文件夹.
34
+ - 从 [DiffSinger 社区声码器项目](https://openvpi.github.io/vocoders) 下载基于预训练声码器的增强器,并解压至 `pretrain/` 文件夹。
35
+ ## 3. 预处理
36
+
37
+ 将所有的训练集数据 (.wav 格式音频切片) 放到 `data/train/audio`.
38
+
39
+ 将所有的验证集数据 (.wav 格式音频切片) 放到 `data/val/audio`.
40
+
41
+ 你也可以运行
42
+ ```bash
43
+ python draw.py
44
+ ```
45
+ 帮助你挑选验证集数据(可以调整 `draw.py` 中的参数修改抽取文件的数量等参数)
46
+
47
+ 接着运行
48
+ ```bash
49
+ python preprocess.py -c configs/combsub.yaml
50
+ ```
51
+
52
+ 训练基于梳齿波减法合成器的模型 (**推荐**),或者运行
53
+
54
+ ```bash
55
+ python preprocess.py -c configs/sins.yaml
56
+ ```
57
+ 训练基于正弦波加法合成器的模型。
58
+
59
+ 您可以在预处理之前修改配置文件 `config/<model_name>.yaml`。
60
+
61
+ 默认配置适用于GTX-1660 显卡训练 44.1khz 高采样率合成器。
62
+
63
+ 注 1: 请保持所有音频切片的采样率与 yaml 配置文件中的采样率一致!如果不一致,程序可以跑,但训练过程中的重新采样将非常缓慢。
64
+
65
+ 注 2:训练数据集的音频切片总数建议为约 1000 个,另外长音频切成小段可以加快训练速度,但所有音频切片的时长不应少于 2 秒。如果音频切片太多,则需要较大的内存,配置文件中将 `cache_all_data` 选项设置为 false 可以解决此问题。
66
+
67
+ 注 3:验证集的音频切片总数建议为 10 个左右,不要放太多,不然验证过程会很慢。
68
+
69
+ 注4:如果您的数据集质量不是很高,请在配置文件中将 'f0_extractor' 设为 'crepe'。crepe 算法的抗噪性最好,但代价是会极大增加数据预处理所需的时间。
70
+
71
+ 更新:现在支持多说话人训练了,配置文件中的 ‘n_spk’ 参数将控制是否训练多说话人模型。如果您要训练**多说话人**模型,为了对说话人进行编号,所有音频文件夹的名称必须是**不大于 ‘n_spk’ 的正整数**,目录结构如下所示:
72
+ ```bash
73
+ # 训练集
74
+ # 第1个说话人
75
+ data/train/audio/1/aaa.wav
76
+ data/train/audio/1/bbb.wav
77
+ ...
78
+ # 第2个说话人
79
+ data/train/audio/2/ccc.wav
80
+ data/train/audio/2/ddd.wav
81
+ ...
82
+
83
+ # 验证集
84
+ # 第1个说话人
85
+ data/val/audio/1/eee.wav
86
+ data/val/audio/1/fff.wav
87
+ ...
88
+ # 第2个说话人
89
+ data/val/audio/2/ggg.wav
90
+ data/val/audio/2/hhh.wav
91
+ ...
92
+ ```
93
+ 当 'n_spk' =1 时,之前**单说话人**模型的目录结构仍然支持,即:
94
+
95
+ ```bash
96
+ # 训练集
97
+ data/train/audio/aaa.wav
98
+ data/train/audio/bbb.wav
99
+ ...
100
+ # 验证集
101
+ data/val/audio/ccc.wav
102
+ data/val/audio/ddd.wav
103
+ ...
104
+ ```
105
+ ## 4. 训练
106
+ ```bash
107
+ # 以训练 combsub 模型为例
108
+ python train.py -c configs/combsub.yaml
109
+ ```
110
+ 训练其他模型方法类似。
111
+
112
+ 您可以随时中止训练,然后运行相同的命令来继续训练。
113
+
114
+ 您也可以在中止训练后,重新预处理新数据集或更改训练参数(batchsize、lr等),然后运行相同的命令,就可以对模型进行微调 (finetune)。
115
+ ## 5. 可视化
116
+ ```bash
117
+ # 使用tensorboard检查训练状态
118
+ tensorboard --logdir=exp
119
+ ```
120
+ 第一次验证 (validation) 后,在 TensorBoard 中可以看到合成后的测试音频。
121
+
122
+ 注:TensorBoard 中的测试音频是 DDSP-SVC 模型的原始输出,并未通过增强器增强。 如果想测试模型使用增强器的合成效果(可能具有更高的合成质量),请使用下一章中描述的方法。
123
+ ## 6. 测试
124
+ (**推荐**)使用预训练声码器增强 DDSP 的输出结果:
125
+ ```bash
126
+ # 默认 enhancer_adaptive_key = 0 正常音域范围内将有更高的音质
127
+ # 设置 enhancer_adaptive_key > 0 可将增强器适配于更高的音域
128
+ python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -e true -eak <enhancer_adaptive_key (semitones)>
129
+ ```
130
+ DDSP 的原始输出结果:
131
+ ```bash
132
+ # 速度快,但音质相对较低(像您在tensorboard里听到的那样)
133
+ python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -e false -id <speaker_id>
134
+ ```
135
+ 关于 f0 提取器和响应阈值的其他选项,参见:
136
+
137
+ ```bash
138
+ python main.py -h
139
+ ```
140
+ 更新: 现在支持混合说话人(捏音色)了。您可以使用 “-mix” 选项来设计属于您自己的音色,下面是个例子:
141
+ ```bash
142
+ # 将1号说话人和2号说话人的音色按照0.5:0.5的比例混合
143
+ python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -mix "{1:0.5, 2:0.5}" -e true -eak 0
144
+ ```
145
+ ## 7. HTTP 服务器 和 VST 支持
146
+ 用以下命令启动服务器
147
+ ```bash
148
+ # 配置在这个 python 文件里面,见注释
149
+ python flask_api.py
150
+ ```
151
+ 当前支持的 VST 前端:
152
+ https://github.com/zhaohui8969/VST_NetProcess-
153
+
154
+ ## 8. 感谢
155
+ * [ddsp](https://github.com/magenta/ddsp)
156
+ * [pc-ddsp](https://github.com/yxlllc/pc-ddsp)
157
+ * [soft-vc](https://github.com/bshall/soft-vc)
158
+ * [DiffSinger (OpenVPI version)](https://github.com/openvpi/DiffSinger)
configs/combsub-old.yaml ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
3
+ f0_min: 65 # about C2
4
+ f0_max: 800 # about G5
5
+ sampling_rate: 44100
6
+ block_size: 512 # Equal to hop_length
7
+ duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
8
+ encoder: 'hubertsoft'
9
+ encoder_sample_rate: 16000
10
+ encoder_hop_size: 320
11
+ encoder_out_channels: 256
12
+ encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
13
+ train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
14
+ valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
15
+ model:
16
+ type: 'CombSub'
17
+ n_mag_allpass: 256
18
+ n_mag_harmonic: 512
19
+ n_mag_noise: 256
20
+ n_spk: 1 # max number of different speakers
21
+ enhancer:
22
+ type: 'nsf-hifigan'
23
+ ckpt: 'pretrain/nsf_hifigan/model'
24
+ loss:
25
+ fft_min: 256
26
+ fft_max: 2048
27
+ n_scale: 4 # rss kernel numbers
28
+ device: cuda
29
+ env:
30
+ expdir: exp/combsub-test
31
+ gpu_id: 0
32
+ train:
33
+ num_workers: 2 # If your cpu and gpu are both very strong, set to 0 may be faster!
34
+ batch_size: 24
35
+ cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
36
+ cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
37
+ cache_fp16: true
38
+ epochs: 100000
39
+ interval_log: 10
40
+ interval_val: 2000
41
+ lr: 0.0005
42
+ weight_decay: 0
configs/combsub.yaml ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
3
+ f0_min: 65 # about C2
4
+ f0_max: 800 # about G5
5
+ sampling_rate: 44100
6
+ block_size: 512 # Equal to hop_length
7
+ duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
8
+ encoder: 'hubertsoft' # 'hubertsoft', 'hubertbase' or 'contentvec'
9
+ encoder_sample_rate: 16000
10
+ encoder_hop_size: 320
11
+ encoder_out_channels: 256
12
+ encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
13
+ train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
14
+ valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
15
+ model:
16
+ type: 'CombSubFast'
17
+ n_spk: 1 # max number of different speakers
18
+ enhancer:
19
+ type: 'nsf-hifigan'
20
+ ckpt: 'pretrain/nsf_hifigan/model'
21
+ loss:
22
+ fft_min: 256
23
+ fft_max: 2048
24
+ n_scale: 4 # rss kernel numbers
25
+ device: cuda
26
+ env:
27
+ expdir: exp/combsub-test
28
+ gpu_id: 0
29
+ train:
30
+ num_workers: 2 # If your cpu and gpu are both very strong, set to 0 may be faster!
31
+ batch_size: 24
32
+ cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
33
+ cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
34
+ cache_fp16: true
35
+ epochs: 100000
36
+ interval_log: 10
37
+ interval_val: 2000
38
+ lr: 0.0005
39
+ weight_decay: 0
configs/sins.yaml ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
3
+ f0_min: 65 # about C2
4
+ f0_max: 800 # about G5
5
+ sampling_rate: 44100
6
+ block_size: 512 # Equal to hop_length
7
+ duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
8
+ encoder: 'hubertsoft' # 'hubertsoft', 'hubertbase' or 'contentvec'
9
+ encoder_sample_rate: 16000
10
+ encoder_hop_size: 320
11
+ encoder_out_channels: 256
12
+ encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
13
+ train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
14
+ valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
15
+ model:
16
+ type: 'Sins'
17
+ n_harmonics: 128
18
+ n_mag_allpass: 256
19
+ n_mag_noise: 256
20
+ n_spk: 1 # max number of different speakers
21
+ enhancer:
22
+ type: 'nsf-hifigan'
23
+ ckpt: 'pretrain/nsf_hifigan/model'
24
+ loss:
25
+ fft_min: 256
26
+ fft_max: 2048
27
+ n_scale: 4 # rss kernel numbers
28
+ device: cuda
29
+ env:
30
+ expdir: exp/sins-test
31
+ gpu_id: 0
32
+ train:
33
+ num_workers: 2 # If your cpu and gpu are both very strong, set to 0 may be faster!
34
+ batch_size: 24
35
+ cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
36
+ cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
37
+ cache_fp16: true
38
+ epochs: 100000
39
+ interval_log: 10
40
+ interval_val: 2000
41
+ lr: 0.0005
42
+ weight_decay: 0
data/train/audio/gitkeep ADDED
File without changes
data/val/audio/gitkeep ADDED
File without changes
data_loaders.py ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import random
3
+ import numpy as np
4
+ import librosa
5
+ import torch
6
+ import random
7
+ from tqdm import tqdm
8
+ from torch.utils.data import Dataset
9
+
10
+ def traverse_dir(
11
+ root_dir,
12
+ extension,
13
+ amount=None,
14
+ str_include=None,
15
+ str_exclude=None,
16
+ is_pure=False,
17
+ is_sort=False,
18
+ is_ext=True):
19
+
20
+ file_list = []
21
+ cnt = 0
22
+ for root, _, files in os.walk(root_dir):
23
+ for file in files:
24
+ if file.endswith(extension):
25
+ # path
26
+ mix_path = os.path.join(root, file)
27
+ pure_path = mix_path[len(root_dir)+1:] if is_pure else mix_path
28
+
29
+ # amount
30
+ if (amount is not None) and (cnt == amount):
31
+ if is_sort:
32
+ file_list.sort()
33
+ return file_list
34
+
35
+ # check string
36
+ if (str_include is not None) and (str_include not in pure_path):
37
+ continue
38
+ if (str_exclude is not None) and (str_exclude in pure_path):
39
+ continue
40
+
41
+ if not is_ext:
42
+ ext = pure_path.split('.')[-1]
43
+ pure_path = pure_path[:-(len(ext)+1)]
44
+ file_list.append(pure_path)
45
+ cnt += 1
46
+ if is_sort:
47
+ file_list.sort()
48
+ return file_list
49
+
50
+
51
+ def get_data_loaders(args, whole_audio=False):
52
+ data_train = AudioDataset(
53
+ args.data.train_path,
54
+ waveform_sec=args.data.duration,
55
+ hop_size=args.data.block_size,
56
+ sample_rate=args.data.sampling_rate,
57
+ load_all_data=args.train.cache_all_data,
58
+ whole_audio=whole_audio,
59
+ n_spk=args.model.n_spk,
60
+ device=args.train.cache_device,
61
+ fp16=args.train.cache_fp16)
62
+ loader_train = torch.utils.data.DataLoader(
63
+ data_train ,
64
+ batch_size=args.train.batch_size if not whole_audio else 1,
65
+ shuffle=True,
66
+ num_workers=args.train.num_workers if args.train.cache_device=='cpu' else 0,
67
+ persistent_workers=(args.train.num_workers > 0) if args.train.cache_device=='cpu' else False,
68
+ pin_memory=True if args.train.cache_device=='cpu' else False
69
+ )
70
+ data_valid = AudioDataset(
71
+ args.data.valid_path,
72
+ waveform_sec=args.data.duration,
73
+ hop_size=args.data.block_size,
74
+ sample_rate=args.data.sampling_rate,
75
+ load_all_data=args.train.cache_all_data,
76
+ whole_audio=True,
77
+ n_spk=args.model.n_spk)
78
+ loader_valid = torch.utils.data.DataLoader(
79
+ data_valid,
80
+ batch_size=1,
81
+ shuffle=False,
82
+ num_workers=0,
83
+ pin_memory=True
84
+ )
85
+ return loader_train, loader_valid
86
+
87
+
88
+ class AudioDataset(Dataset):
89
+ def __init__(
90
+ self,
91
+ path_root,
92
+ waveform_sec,
93
+ hop_size,
94
+ sample_rate,
95
+ load_all_data=True,
96
+ whole_audio=False,
97
+ n_spk=1,
98
+ device = 'cpu',
99
+ fp16 = False
100
+ ):
101
+ super().__init__()
102
+
103
+ self.waveform_sec = waveform_sec
104
+ self.sample_rate = sample_rate
105
+ self.hop_size = hop_size
106
+ self.path_root = path_root
107
+ self.paths = traverse_dir(
108
+ os.path.join(path_root, 'audio'),
109
+ extension='wav',
110
+ is_pure=True,
111
+ is_sort=True,
112
+ is_ext=False
113
+ )
114
+ self.whole_audio = whole_audio
115
+ self.data_buffer={}
116
+ if load_all_data:
117
+ print('Load all the data from :', path_root)
118
+ else:
119
+ print('Load the f0, volume data from :', path_root)
120
+ for name in tqdm(self.paths, total=len(self.paths)):
121
+ path_audio = os.path.join(self.path_root, 'audio', name) + '.wav'
122
+ duration = librosa.get_duration(filename = path_audio, sr = self.sample_rate)
123
+
124
+ path_f0 = os.path.join(self.path_root, 'f0', name) + '.npy'
125
+ f0 = np.load(path_f0)
126
+ f0 = torch.from_numpy(f0).float().unsqueeze(-1).to(device)
127
+
128
+ path_volume = os.path.join(self.path_root, 'volume', name) + '.npy'
129
+ volume = np.load(path_volume)
130
+ volume = torch.from_numpy(volume).float().unsqueeze(-1).to(device)
131
+
132
+ if n_spk is not None and n_spk > 1:
133
+ spk_id = int(os.path.dirname(name)) if str.isdigit(os.path.dirname(name)) else 0
134
+ if spk_id < 1 or spk_id > n_spk:
135
+ raise ValueError(' [x] Muiti-speaker traing error : spk_id must be a positive integer from 1 to n_spk ')
136
+ else:
137
+ spk_id = 1
138
+ spk_id = torch.LongTensor(np.array([spk_id])).to(device)
139
+
140
+ if load_all_data:
141
+ audio, sr = librosa.load(path_audio, sr=self.sample_rate)
142
+ if len(audio.shape) > 1:
143
+ audio = librosa.to_mono(audio)
144
+ audio = torch.from_numpy(audio).to(device)
145
+
146
+ path_units = os.path.join(self.path_root, 'units', name) + '.npy'
147
+ units = np.load(path_units)
148
+ units = torch.from_numpy(units).to(device)
149
+
150
+ if fp16:
151
+ audio = audio.half()
152
+ units = units.half()
153
+
154
+ self.data_buffer[name] = {
155
+ 'duration': duration,
156
+ 'audio': audio,
157
+ 'units': units,
158
+ 'f0': f0,
159
+ 'volume': volume,
160
+ 'spk_id': spk_id
161
+ }
162
+ else:
163
+ self.data_buffer[name] = {
164
+ 'duration': duration,
165
+ 'f0': f0,
166
+ 'volume': volume,
167
+ 'spk_id': spk_id
168
+ }
169
+
170
+
171
+ def __getitem__(self, file_idx):
172
+ name = self.paths[file_idx]
173
+ data_buffer = self.data_buffer[name]
174
+ # check duration. if too short, then skip
175
+ if data_buffer['duration'] < (self.waveform_sec + 0.1):
176
+ return self.__getitem__( (file_idx + 1) % len(self.paths))
177
+
178
+ # get item
179
+ return self.get_data(name, data_buffer)
180
+
181
+ def get_data(self, name, data_buffer):
182
+ frame_resolution = self.hop_size / self.sample_rate
183
+ duration = data_buffer['duration']
184
+ waveform_sec = duration if self.whole_audio else self.waveform_sec
185
+
186
+ # load audio
187
+ idx_from = 0 if self.whole_audio else random.uniform(0, duration - waveform_sec - 0.1)
188
+ start_frame = int(idx_from / frame_resolution)
189
+ units_frame_len = int(waveform_sec / frame_resolution)
190
+ audio = data_buffer.get('audio')
191
+ if audio is None:
192
+ path_audio = os.path.join(self.path_root, 'audio', name) + '.wav'
193
+ audio, sr = librosa.load(
194
+ path_audio,
195
+ sr = self.sample_rate,
196
+ offset = start_frame * frame_resolution,
197
+ duration = waveform_sec)
198
+ if len(audio.shape) > 1:
199
+ audio = librosa.to_mono(audio)
200
+ # clip audio into N seconds
201
+ audio = audio[ : audio.shape[-1] // self.hop_size * self.hop_size]
202
+ audio = torch.from_numpy(audio).float()
203
+ else:
204
+ audio = audio[start_frame * self.hop_size : (start_frame + units_frame_len) * self.hop_size]
205
+
206
+ # load units
207
+ units = data_buffer.get('units')
208
+ if units is None:
209
+ units = os.path.join(self.path_root, 'units', name) + '.npy'
210
+ units = np.load(units)
211
+ units = units[start_frame : start_frame + units_frame_len]
212
+ units = torch.from_numpy(units).float()
213
+ else:
214
+ units = units[start_frame : start_frame + units_frame_len]
215
+
216
+ # load f0
217
+ f0 = data_buffer.get('f0')
218
+ f0_frames = f0[start_frame : start_frame + units_frame_len]
219
+
220
+ # load volume
221
+ volume = data_buffer.get('volume')
222
+ volume_frames = volume[start_frame : start_frame + units_frame_len]
223
+
224
+ # load spk_id
225
+ spk_id = data_buffer.get('spk_id')
226
+
227
+ return dict(audio=audio, f0=f0_frames, volume=volume_frames, units=units, spk_id=spk_id, name=name)
228
+
229
+ def __len__(self):
230
+ return len(self.paths)
ddsp/__init__.py ADDED
File without changes
ddsp/core.py ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from torch.nn import functional as F
4
+
5
+ import math
6
+ import numpy as np
7
+
8
+ def get_fft_size(frame_size: int, ir_size: int, power_of_2: bool = True):
9
+ """Calculate final size for efficient FFT.
10
+ Args:
11
+ frame_size: Size of the audio frame.
12
+ ir_size: Size of the convolving impulse response.
13
+ power_of_2: Constrain to be a power of 2. If False, allow other 5-smooth
14
+ numbers. TPU requires power of 2, while GPU is more flexible.
15
+ Returns:
16
+ fft_size: Size for efficient FFT.
17
+ """
18
+ convolved_frame_size = ir_size + frame_size - 1
19
+ if power_of_2:
20
+ # Next power of 2.
21
+ fft_size = int(2**np.ceil(np.log2(convolved_frame_size)))
22
+ else:
23
+ fft_size = convolved_frame_size
24
+ return fft_size
25
+
26
+
27
+ def upsample(signal, factor):
28
+ signal = signal.permute(0, 2, 1)
29
+ signal = nn.functional.interpolate(torch.cat((signal,signal[:,:,-1:]),2), size=signal.shape[-1] * factor + 1, mode='linear', align_corners=True)
30
+ signal = signal[:,:,:-1]
31
+ return signal.permute(0, 2, 1)
32
+
33
+
34
+ def remove_above_fmax(amplitudes, pitch, fmax, level_start=1):
35
+ n_harm = amplitudes.shape[-1]
36
+ pitches = pitch * torch.arange(level_start, n_harm + level_start).to(pitch)
37
+ aa = (pitches < fmax).float() + 1e-7
38
+ return amplitudes * aa
39
+
40
+
41
+ def crop_and_compensate_delay(audio, audio_size, ir_size,
42
+ padding = 'same',
43
+ delay_compensation = -1):
44
+ """Crop audio output from convolution to compensate for group delay.
45
+ Args:
46
+ audio: Audio after convolution. Tensor of shape [batch, time_steps].
47
+ audio_size: Initial size of the audio before convolution.
48
+ ir_size: Size of the convolving impulse response.
49
+ padding: Either 'valid' or 'same'. For 'same' the final output to be the
50
+ same size as the input audio (audio_timesteps). For 'valid' the audio is
51
+ extended to include the tail of the impulse response (audio_timesteps +
52
+ ir_timesteps - 1).
53
+ delay_compensation: Samples to crop from start of output audio to compensate
54
+ for group delay of the impulse response. If delay_compensation < 0 it
55
+ defaults to automatically calculating a constant group delay of the
56
+ windowed linear phase filter from frequency_impulse_response().
57
+ Returns:
58
+ Tensor of cropped and shifted audio.
59
+ Raises:
60
+ ValueError: If padding is not either 'valid' or 'same'.
61
+ """
62
+ # Crop the output.
63
+ if padding == 'valid':
64
+ crop_size = ir_size + audio_size - 1
65
+ elif padding == 'same':
66
+ crop_size = audio_size
67
+ else:
68
+ raise ValueError('Padding must be \'valid\' or \'same\', instead '
69
+ 'of {}.'.format(padding))
70
+
71
+ # Compensate for the group delay of the filter by trimming the front.
72
+ # For an impulse response produced by frequency_impulse_response(),
73
+ # the group delay is constant because the filter is linear phase.
74
+ total_size = int(audio.shape[-1])
75
+ crop = total_size - crop_size
76
+ start = (ir_size // 2 if delay_compensation < 0 else delay_compensation)
77
+ end = crop - start
78
+ return audio[:, start:-end]
79
+
80
+
81
+ def fft_convolve(audio,
82
+ impulse_response): # B, n_frames, 2*(n_mags-1)
83
+ """Filter audio with frames of time-varying impulse responses.
84
+ Time-varying filter. Given audio [batch, n_samples], and a series of impulse
85
+ responses [batch, n_frames, n_impulse_response], splits the audio into frames,
86
+ applies filters, and then overlap-and-adds audio back together.
87
+ Applies non-windowed non-overlapping STFT/ISTFT to efficiently compute
88
+ convolution for large impulse response sizes.
89
+ Args:
90
+ audio: Input audio. Tensor of shape [batch, audio_timesteps].
91
+ impulse_response: Finite impulse response to convolve. Can either be a 2-D
92
+ Tensor of shape [batch, ir_size], or a 3-D Tensor of shape [batch,
93
+ ir_frames, ir_size]. A 2-D tensor will apply a single linear
94
+ time-invariant filter to the audio. A 3-D Tensor will apply a linear
95
+ time-varying filter. Automatically chops the audio into equally shaped
96
+ blocks to match ir_frames.
97
+ Returns:
98
+ audio_out: Convolved audio. Tensor of shape
99
+ [batch, audio_timesteps].
100
+ """
101
+ # Add a frame dimension to impulse response if it doesn't have one.
102
+ ir_shape = impulse_response.size()
103
+ if len(ir_shape) == 2:
104
+ impulse_response = impulse_response.unsqueeze(1)
105
+ ir_shape = impulse_response.size()
106
+
107
+ # Get shapes of audio and impulse response.
108
+ batch_size_ir, n_ir_frames, ir_size = ir_shape
109
+ batch_size, audio_size = audio.size() # B, T
110
+
111
+ # Validate that batch sizes match.
112
+ if batch_size != batch_size_ir:
113
+ raise ValueError('Batch size of audio ({}) and impulse response ({}) must '
114
+ 'be the same.'.format(batch_size, batch_size_ir))
115
+
116
+ # Cut audio into 50% overlapped frames (center padding).
117
+ hop_size = int(audio_size / n_ir_frames)
118
+ frame_size = 2 * hop_size
119
+ audio_frames = F.pad(audio, (hop_size, hop_size)).unfold(1, frame_size, hop_size)
120
+
121
+ # Apply Bartlett (triangular) window
122
+ window = torch.bartlett_window(frame_size).to(audio_frames)
123
+ audio_frames = audio_frames * window
124
+
125
+ # Pad and FFT the audio and impulse responses.
126
+ fft_size = get_fft_size(frame_size, ir_size, power_of_2=False)
127
+ audio_fft = torch.fft.rfft(audio_frames, fft_size)
128
+ ir_fft = torch.fft.rfft(torch.cat((impulse_response,impulse_response[:,-1:,:]),1), fft_size)
129
+
130
+ # Multiply the FFTs (same as convolution in time).
131
+ audio_ir_fft = torch.multiply(audio_fft, ir_fft)
132
+
133
+ # Take the IFFT to resynthesize audio.
134
+ audio_frames_out = torch.fft.irfft(audio_ir_fft, fft_size)
135
+
136
+ # Overlap Add
137
+ batch_size, n_audio_frames, frame_size = audio_frames_out.size() # # B, n_frames+1, 2*(hop_size+n_mags-1)-1
138
+ fold = torch.nn.Fold(output_size=(1, (n_audio_frames - 1) * hop_size + frame_size),kernel_size=(1, frame_size),stride=(1, hop_size))
139
+ output_signal = fold(audio_frames_out.transpose(1, 2)).squeeze(1).squeeze(1)
140
+
141
+ # Crop and shift the output audio.
142
+ output_signal = crop_and_compensate_delay(output_signal[:,hop_size:], audio_size, ir_size)
143
+ return output_signal
144
+
145
+
146
+ def apply_window_to_impulse_response(impulse_response, # B, n_frames, 2*(n_mag-1)
147
+ window_size: int = 0,
148
+ causal: bool = False):
149
+ """Apply a window to an impulse response and put in causal form.
150
+ Args:
151
+ impulse_response: A series of impulse responses frames to window, of shape
152
+ [batch, n_frames, ir_size]. ---------> ir_size means size of filter_bank ??????
153
+
154
+ window_size: Size of the window to apply in the time domain. If window_size
155
+ is less than 1, it defaults to the impulse_response size.
156
+ causal: Impulse response input is in causal form (peak in the middle).
157
+ Returns:
158
+ impulse_response: Windowed impulse response in causal form, with last
159
+ dimension cropped to window_size if window_size is greater than 0 and less
160
+ than ir_size.
161
+ """
162
+
163
+ # If IR is in causal form, put it in zero-phase form.
164
+ if causal:
165
+ impulse_response = torch.fftshift(impulse_response, axes=-1)
166
+
167
+ # Get a window for better time/frequency resolution than rectangular.
168
+ # Window defaults to IR size, cannot be bigger.
169
+ ir_size = int(impulse_response.size(-1))
170
+ if (window_size <= 0) or (window_size > ir_size):
171
+ window_size = ir_size
172
+ window = nn.Parameter(torch.hann_window(window_size), requires_grad = False).to(impulse_response)
173
+
174
+ # Zero pad the window and put in in zero-phase form.
175
+ padding = ir_size - window_size
176
+ if padding > 0:
177
+ half_idx = (window_size + 1) // 2
178
+ window = torch.cat([window[half_idx:],
179
+ torch.zeros([padding]),
180
+ window[:half_idx]], axis=0)
181
+ else:
182
+ window = window.roll(window.size(-1)//2, -1)
183
+
184
+ # Apply the window, to get new IR (both in zero-phase form).
185
+ window = window.unsqueeze(0)
186
+ impulse_response = impulse_response * window
187
+
188
+ # Put IR in causal form and trim zero padding.
189
+ if padding > 0:
190
+ first_half_start = (ir_size - (half_idx - 1)) + 1
191
+ second_half_end = half_idx + 1
192
+ impulse_response = torch.cat([impulse_response[..., first_half_start:],
193
+ impulse_response[..., :second_half_end]],
194
+ dim=-1)
195
+ else:
196
+ impulse_response = impulse_response.roll(impulse_response.size(-1)//2, -1)
197
+
198
+ return impulse_response
199
+
200
+
201
+ def apply_dynamic_window_to_impulse_response(impulse_response, # B, n_frames, 2*(n_mag-1) or 2*n_mag-1
202
+ half_width_frames): # B,n_frames, 1
203
+ ir_size = int(impulse_response.size(-1)) # 2*(n_mag -1) or 2*n_mag-1
204
+
205
+ window = torch.arange(-(ir_size // 2), (ir_size + 1) // 2).to(impulse_response) / half_width_frames
206
+ window[window > 1] = 0
207
+ window = (1 + torch.cos(np.pi * window)) / 2 # B, n_frames, 2*(n_mag -1) or 2*n_mag-1
208
+
209
+ impulse_response = impulse_response.roll(ir_size // 2, -1)
210
+ impulse_response = impulse_response * window
211
+
212
+ return impulse_response
213
+
214
+
215
+ def frequency_impulse_response(magnitudes,
216
+ hann_window = True,
217
+ half_width_frames = None):
218
+
219
+ # Get the IR
220
+ impulse_response = torch.fft.irfft(magnitudes) # B, n_frames, 2*(n_mags-1)
221
+
222
+ # Window and put in causal form.
223
+ if hann_window:
224
+ if half_width_frames is None:
225
+ impulse_response = apply_window_to_impulse_response(impulse_response)
226
+ else:
227
+ impulse_response = apply_dynamic_window_to_impulse_response(impulse_response, half_width_frames)
228
+ else:
229
+ impulse_response = impulse_response.roll(impulse_response.size(-1) // 2, -1)
230
+
231
+ return impulse_response
232
+
233
+
234
+ def frequency_filter(audio,
235
+ magnitudes,
236
+ hann_window=True,
237
+ half_width_frames=None):
238
+
239
+ impulse_response = frequency_impulse_response(magnitudes, hann_window, half_width_frames)
240
+
241
+ return fft_convolve(audio, impulse_response)
242
+
ddsp/loss.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+
3
+ import torch
4
+ import torch.nn as nn
5
+ import torchaudio
6
+ from torch.nn import functional as F
7
+ from .core import upsample
8
+
9
+ class SSSLoss(nn.Module):
10
+ """
11
+ Single-scale Spectral Loss.
12
+ """
13
+
14
+ def __init__(self, n_fft=111, alpha=1.0, overlap=0, eps=1e-7):
15
+ super().__init__()
16
+ self.n_fft = n_fft
17
+ self.alpha = alpha
18
+ self.eps = eps
19
+ self.hop_length = int(n_fft * (1 - overlap)) # 25% of the length
20
+ self.spec = torchaudio.transforms.Spectrogram(n_fft=self.n_fft, hop_length=self.hop_length, power=1, normalized=True, center=False)
21
+
22
+ def forward(self, x_true, x_pred):
23
+ S_true = self.spec(x_true) + self.eps
24
+ S_pred = self.spec(x_pred) + self.eps
25
+
26
+ converge_term = torch.mean(torch.linalg.norm(S_true - S_pred, dim = (1, 2)) / torch.linalg.norm(S_true + S_pred, dim = (1, 2)))
27
+
28
+ log_term = F.l1_loss(S_true.log(), S_pred.log())
29
+
30
+ loss = converge_term + self.alpha * log_term
31
+ return loss
32
+
33
+
34
+ class RSSLoss(nn.Module):
35
+ '''
36
+ Random-scale Spectral Loss.
37
+ '''
38
+
39
+ def __init__(self, fft_min, fft_max, n_scale, alpha=1.0, overlap=0, eps=1e-7, device='cuda'):
40
+ super().__init__()
41
+ self.fft_min = fft_min
42
+ self.fft_max = fft_max
43
+ self.n_scale = n_scale
44
+ self.lossdict = {}
45
+ for n_fft in range(fft_min, fft_max):
46
+ self.lossdict[n_fft] = SSSLoss(n_fft, alpha, overlap, eps).to(device)
47
+
48
+ def forward(self, x_pred, x_true):
49
+ value = 0.
50
+ n_ffts = torch.randint(self.fft_min, self.fft_max, (self.n_scale,))
51
+ for n_fft in n_ffts:
52
+ loss_func = self.lossdict[int(n_fft)]
53
+ value += loss_func(x_true, x_pred)
54
+ return value / self.n_scale
55
+
56
+
57
+
ddsp/pcmer.py ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+ from torch import nn
4
+ import math
5
+ from functools import partial
6
+ from einops import rearrange, repeat
7
+
8
+ from local_attention import LocalAttention
9
+ import torch.nn.functional as F
10
+ #import fast_transformers.causal_product.causal_product_cuda
11
+
12
+ def softmax_kernel(data, *, projection_matrix, is_query, normalize_data=True, eps=1e-4, device = None):
13
+ b, h, *_ = data.shape
14
+ # (batch size, head, length, model_dim)
15
+
16
+ # normalize model dim
17
+ data_normalizer = (data.shape[-1] ** -0.25) if normalize_data else 1.
18
+
19
+ # what is ration?, projection_matrix.shape[0] --> 266
20
+
21
+ ratio = (projection_matrix.shape[0] ** -0.5)
22
+
23
+ projection = repeat(projection_matrix, 'j d -> b h j d', b = b, h = h)
24
+ projection = projection.type_as(data)
25
+
26
+ #data_dash = w^T x
27
+ data_dash = torch.einsum('...id,...jd->...ij', (data_normalizer * data), projection)
28
+
29
+
30
+ # diag_data = D**2
31
+ diag_data = data ** 2
32
+ diag_data = torch.sum(diag_data, dim=-1)
33
+ diag_data = (diag_data / 2.0) * (data_normalizer ** 2)
34
+ diag_data = diag_data.unsqueeze(dim=-1)
35
+
36
+ #print ()
37
+ if is_query:
38
+ data_dash = ratio * (
39
+ torch.exp(data_dash - diag_data -
40
+ torch.max(data_dash, dim=-1, keepdim=True).values) + eps)
41
+ else:
42
+ data_dash = ratio * (
43
+ torch.exp(data_dash - diag_data + eps))#- torch.max(data_dash)) + eps)
44
+
45
+ return data_dash.type_as(data)
46
+
47
+ def orthogonal_matrix_chunk(cols, qr_uniform_q = False, device = None):
48
+ unstructured_block = torch.randn((cols, cols), device = device)
49
+ q, r = torch.linalg.qr(unstructured_block.cpu(), mode='reduced')
50
+ q, r = map(lambda t: t.to(device), (q, r))
51
+
52
+ # proposed by @Parskatt
53
+ # to make sure Q is uniform https://arxiv.org/pdf/math-ph/0609050.pdf
54
+ if qr_uniform_q:
55
+ d = torch.diag(r, 0)
56
+ q *= d.sign()
57
+ return q.t()
58
+ def exists(val):
59
+ return val is not None
60
+
61
+ def empty(tensor):
62
+ return tensor.numel() == 0
63
+
64
+ def default(val, d):
65
+ return val if exists(val) else d
66
+
67
+ def cast_tuple(val):
68
+ return (val,) if not isinstance(val, tuple) else val
69
+
70
+ class PCmer(nn.Module):
71
+ """The encoder that is used in the Transformer model."""
72
+
73
+ def __init__(self,
74
+ num_layers,
75
+ num_heads,
76
+ dim_model,
77
+ dim_keys,
78
+ dim_values,
79
+ residual_dropout,
80
+ attention_dropout):
81
+ super().__init__()
82
+ self.num_layers = num_layers
83
+ self.num_heads = num_heads
84
+ self.dim_model = dim_model
85
+ self.dim_values = dim_values
86
+ self.dim_keys = dim_keys
87
+ self.residual_dropout = residual_dropout
88
+ self.attention_dropout = attention_dropout
89
+
90
+ self._layers = nn.ModuleList([_EncoderLayer(self) for _ in range(num_layers)])
91
+
92
+ # METHODS ########################################################################################################
93
+
94
+ def forward(self, phone, mask=None):
95
+
96
+ # apply all layers to the input
97
+ for (i, layer) in enumerate(self._layers):
98
+ phone = layer(phone, mask)
99
+ # provide the final sequence
100
+ return phone
101
+
102
+
103
+ # ==================================================================================================================== #
104
+ # CLASS _ E N C O D E R L A Y E R #
105
+ # ==================================================================================================================== #
106
+
107
+
108
+ class _EncoderLayer(nn.Module):
109
+ """One layer of the encoder.
110
+
111
+ Attributes:
112
+ attn: (:class:`mha.MultiHeadAttention`): The attention mechanism that is used to read the input sequence.
113
+ feed_forward (:class:`ffl.FeedForwardLayer`): The feed-forward layer on top of the attention mechanism.
114
+ """
115
+
116
+ def __init__(self, parent: PCmer):
117
+ """Creates a new instance of ``_EncoderLayer``.
118
+
119
+ Args:
120
+ parent (Encoder): The encoder that the layers is created for.
121
+ """
122
+ super().__init__()
123
+
124
+
125
+ self.conformer = ConformerConvModule(parent.dim_model)
126
+ self.norm = nn.LayerNorm(parent.dim_model)
127
+ self.dropout = nn.Dropout(parent.residual_dropout)
128
+
129
+ # selfatt -> fastatt: performer!
130
+ self.attn = SelfAttention(dim = parent.dim_model,
131
+ heads = parent.num_heads,
132
+ causal = False)
133
+
134
+ # METHODS ########################################################################################################
135
+
136
+ def forward(self, phone, mask=None):
137
+
138
+ # compute attention sub-layer
139
+ phone = phone + (self.attn(self.norm(phone), mask=mask))
140
+
141
+ phone = phone + (self.conformer(phone))
142
+
143
+ return phone
144
+
145
+ def calc_same_padding(kernel_size):
146
+ pad = kernel_size // 2
147
+ return (pad, pad - (kernel_size + 1) % 2)
148
+
149
+ # helper classes
150
+
151
+ class Swish(nn.Module):
152
+ def forward(self, x):
153
+ return x * x.sigmoid()
154
+
155
+ class Transpose(nn.Module):
156
+ def __init__(self, dims):
157
+ super().__init__()
158
+ assert len(dims) == 2, 'dims must be a tuple of two dimensions'
159
+ self.dims = dims
160
+
161
+ def forward(self, x):
162
+ return x.transpose(*self.dims)
163
+
164
+ class GLU(nn.Module):
165
+ def __init__(self, dim):
166
+ super().__init__()
167
+ self.dim = dim
168
+
169
+ def forward(self, x):
170
+ out, gate = x.chunk(2, dim=self.dim)
171
+ return out * gate.sigmoid()
172
+
173
+ class DepthWiseConv1d(nn.Module):
174
+ def __init__(self, chan_in, chan_out, kernel_size, padding):
175
+ super().__init__()
176
+ self.padding = padding
177
+ self.conv = nn.Conv1d(chan_in, chan_out, kernel_size, groups = chan_in)
178
+
179
+ def forward(self, x):
180
+ x = F.pad(x, self.padding)
181
+ return self.conv(x)
182
+
183
+ class ConformerConvModule(nn.Module):
184
+ def __init__(
185
+ self,
186
+ dim,
187
+ causal = False,
188
+ expansion_factor = 2,
189
+ kernel_size = 31,
190
+ dropout = 0.):
191
+ super().__init__()
192
+
193
+ inner_dim = dim * expansion_factor
194
+ padding = calc_same_padding(kernel_size) if not causal else (kernel_size - 1, 0)
195
+
196
+ self.net = nn.Sequential(
197
+ nn.LayerNorm(dim),
198
+ Transpose((1, 2)),
199
+ nn.Conv1d(dim, inner_dim * 2, 1),
200
+ GLU(dim=1),
201
+ DepthWiseConv1d(inner_dim, inner_dim, kernel_size = kernel_size, padding = padding),
202
+ #nn.BatchNorm1d(inner_dim) if not causal else nn.Identity(),
203
+ Swish(),
204
+ nn.Conv1d(inner_dim, dim, 1),
205
+ Transpose((1, 2)),
206
+ nn.Dropout(dropout)
207
+ )
208
+
209
+ def forward(self, x):
210
+ return self.net(x)
211
+
212
+ def linear_attention(q, k, v):
213
+ if v is None:
214
+ #print (k.size(), q.size())
215
+ out = torch.einsum('...ed,...nd->...ne', k, q)
216
+ return out
217
+
218
+ else:
219
+ k_cumsum = k.sum(dim = -2)
220
+ #k_cumsum = k.sum(dim = -2)
221
+ D_inv = 1. / (torch.einsum('...nd,...d->...n', q, k_cumsum.type_as(q)) + 1e-8)
222
+
223
+ context = torch.einsum('...nd,...ne->...de', k, v)
224
+ #print ("TRUEEE: ", context.size(), q.size(), D_inv.size())
225
+ out = torch.einsum('...de,...nd,...n->...ne', context, q, D_inv)
226
+ return out
227
+
228
+ def gaussian_orthogonal_random_matrix(nb_rows, nb_columns, scaling = 0, qr_uniform_q = False, device = None):
229
+ nb_full_blocks = int(nb_rows / nb_columns)
230
+ #print (nb_full_blocks)
231
+ block_list = []
232
+
233
+ for _ in range(nb_full_blocks):
234
+ q = orthogonal_matrix_chunk(nb_columns, qr_uniform_q = qr_uniform_q, device = device)
235
+ block_list.append(q)
236
+ # block_list[n] is a orthogonal matrix ... (model_dim * model_dim)
237
+ #print (block_list[0].size(), torch.einsum('...nd,...nd->...n', block_list[0], torch.roll(block_list[0],1,1)))
238
+ #print (nb_rows, nb_full_blocks, nb_columns)
239
+ remaining_rows = nb_rows - nb_full_blocks * nb_columns
240
+ #print (remaining_rows)
241
+ if remaining_rows > 0:
242
+ q = orthogonal_matrix_chunk(nb_columns, qr_uniform_q = qr_uniform_q, device = device)
243
+ #print (q[:remaining_rows].size())
244
+ block_list.append(q[:remaining_rows])
245
+
246
+ final_matrix = torch.cat(block_list)
247
+
248
+ if scaling == 0:
249
+ multiplier = torch.randn((nb_rows, nb_columns), device = device).norm(dim = 1)
250
+ elif scaling == 1:
251
+ multiplier = math.sqrt((float(nb_columns))) * torch.ones((nb_rows,), device = device)
252
+ else:
253
+ raise ValueError(f'Invalid scaling {scaling}')
254
+
255
+ return torch.diag(multiplier) @ final_matrix
256
+
257
+ class FastAttention(nn.Module):
258
+ def __init__(self, dim_heads, nb_features = None, ortho_scaling = 0, causal = False, generalized_attention = False, kernel_fn = nn.ReLU(), qr_uniform_q = False, no_projection = False):
259
+ super().__init__()
260
+ nb_features = default(nb_features, int(dim_heads * math.log(dim_heads)))
261
+
262
+ self.dim_heads = dim_heads
263
+ self.nb_features = nb_features
264
+ self.ortho_scaling = ortho_scaling
265
+
266
+ self.create_projection = partial(gaussian_orthogonal_random_matrix, nb_rows = self.nb_features, nb_columns = dim_heads, scaling = ortho_scaling, qr_uniform_q = qr_uniform_q)
267
+ projection_matrix = self.create_projection()
268
+ self.register_buffer('projection_matrix', projection_matrix)
269
+
270
+ self.generalized_attention = generalized_attention
271
+ self.kernel_fn = kernel_fn
272
+
273
+ # if this is turned on, no projection will be used
274
+ # queries and keys will be softmax-ed as in the original efficient attention paper
275
+ self.no_projection = no_projection
276
+
277
+ self.causal = causal
278
+ if causal:
279
+ try:
280
+ import fast_transformers.causal_product.causal_product_cuda
281
+ self.causal_linear_fn = partial(causal_linear_attention)
282
+ except ImportError:
283
+ print('unable to import cuda code for auto-regressive Performer. will default to the memory inefficient non-cuda version')
284
+ self.causal_linear_fn = causal_linear_attention_noncuda
285
+ @torch.no_grad()
286
+ def redraw_projection_matrix(self):
287
+ projections = self.create_projection()
288
+ self.projection_matrix.copy_(projections)
289
+ del projections
290
+
291
+ def forward(self, q, k, v):
292
+ device = q.device
293
+
294
+ if self.no_projection:
295
+ q = q.softmax(dim = -1)
296
+ k = torch.exp(k) if self.causal else k.softmax(dim = -2)
297
+
298
+ elif self.generalized_attention:
299
+ create_kernel = partial(generalized_kernel, kernel_fn = self.kernel_fn, projection_matrix = self.projection_matrix, device = device)
300
+ q, k = map(create_kernel, (q, k))
301
+
302
+ else:
303
+ create_kernel = partial(softmax_kernel, projection_matrix = self.projection_matrix, device = device)
304
+
305
+ q = create_kernel(q, is_query = True)
306
+ k = create_kernel(k, is_query = False)
307
+
308
+ attn_fn = linear_attention if not self.causal else self.causal_linear_fn
309
+ if v is None:
310
+ out = attn_fn(q, k, None)
311
+ return out
312
+ else:
313
+ out = attn_fn(q, k, v)
314
+ return out
315
+ class SelfAttention(nn.Module):
316
+ def __init__(self, dim, causal = False, heads = 8, dim_head = 64, local_heads = 0, local_window_size = 256, nb_features = None, feature_redraw_interval = 1000, generalized_attention = False, kernel_fn = nn.ReLU(), qr_uniform_q = False, dropout = 0., no_projection = False):
317
+ super().__init__()
318
+ assert dim % heads == 0, 'dimension must be divisible by number of heads'
319
+ dim_head = default(dim_head, dim // heads)
320
+ inner_dim = dim_head * heads
321
+ self.fast_attention = FastAttention(dim_head, nb_features, causal = causal, generalized_attention = generalized_attention, kernel_fn = kernel_fn, qr_uniform_q = qr_uniform_q, no_projection = no_projection)
322
+
323
+ self.heads = heads
324
+ self.global_heads = heads - local_heads
325
+ self.local_attn = LocalAttention(window_size = local_window_size, causal = causal, autopad = True, dropout = dropout, look_forward = int(not causal), rel_pos_emb_config = (dim_head, local_heads)) if local_heads > 0 else None
326
+
327
+ #print (heads, nb_features, dim_head)
328
+ #name_embedding = torch.zeros(110, heads, dim_head, dim_head)
329
+ #self.name_embedding = nn.Parameter(name_embedding, requires_grad=True)
330
+
331
+
332
+ self.to_q = nn.Linear(dim, inner_dim)
333
+ self.to_k = nn.Linear(dim, inner_dim)
334
+ self.to_v = nn.Linear(dim, inner_dim)
335
+ self.to_out = nn.Linear(inner_dim, dim)
336
+ self.dropout = nn.Dropout(dropout)
337
+
338
+ @torch.no_grad()
339
+ def redraw_projection_matrix(self):
340
+ self.fast_attention.redraw_projection_matrix()
341
+ #torch.nn.init.zeros_(self.name_embedding)
342
+ #print (torch.sum(self.name_embedding))
343
+ def forward(self, x, context = None, mask = None, context_mask = None, name=None, inference=False, **kwargs):
344
+ b, n, _, h, gh = *x.shape, self.heads, self.global_heads
345
+
346
+ cross_attend = exists(context)
347
+
348
+ context = default(context, x)
349
+ context_mask = default(context_mask, mask) if not cross_attend else context_mask
350
+ #print (torch.sum(self.name_embedding))
351
+ q, k, v = self.to_q(x), self.to_k(context), self.to_v(context)
352
+
353
+ q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), (q, k, v))
354
+ (q, lq), (k, lk), (v, lv) = map(lambda t: (t[:, :gh], t[:, gh:]), (q, k, v))
355
+
356
+ attn_outs = []
357
+ #print (name)
358
+ #print (self.name_embedding[name].size())
359
+ if not empty(q):
360
+ if exists(context_mask):
361
+ global_mask = context_mask[:, None, :, None]
362
+ v.masked_fill_(~global_mask, 0.)
363
+ if cross_attend:
364
+ pass
365
+ #print (torch.sum(self.name_embedding))
366
+ #out = self.fast_attention(q,self.name_embedding[name],None)
367
+ #print (torch.sum(self.name_embedding[...,-1:]))
368
+ else:
369
+ out = self.fast_attention(q, k, v)
370
+ attn_outs.append(out)
371
+
372
+ if not empty(lq):
373
+ assert not cross_attend, 'local attention is not compatible with cross attention'
374
+ out = self.local_attn(lq, lk, lv, input_mask = mask)
375
+ attn_outs.append(out)
376
+
377
+ out = torch.cat(attn_outs, dim = 1)
378
+ out = rearrange(out, 'b h n d -> b n (h d)')
379
+ out = self.to_out(out)
380
+ return self.dropout(out)
ddsp/unit2control.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gin
2
+
3
+ import numpy as np
4
+ import torch
5
+ import torch.nn as nn
6
+ from torch.nn.utils import weight_norm
7
+
8
+ from .pcmer import PCmer
9
+
10
+
11
+ def split_to_dict(tensor, tensor_splits):
12
+ """Split a tensor into a dictionary of multiple tensors."""
13
+ labels = []
14
+ sizes = []
15
+
16
+ for k, v in tensor_splits.items():
17
+ labels.append(k)
18
+ sizes.append(v)
19
+
20
+ tensors = torch.split(tensor, sizes, dim=-1)
21
+ return dict(zip(labels, tensors))
22
+
23
+
24
+ class Unit2Control(nn.Module):
25
+ def __init__(
26
+ self,
27
+ input_channel,
28
+ n_spk,
29
+ output_splits):
30
+ super().__init__()
31
+ self.output_splits = output_splits
32
+ self.f0_embed = nn.Linear(1, 256)
33
+ self.phase_embed = nn.Linear(1, 256)
34
+ self.volume_embed = nn.Linear(1, 256)
35
+ self.n_spk = n_spk
36
+ if n_spk is not None and n_spk > 1:
37
+ self.spk_embed = nn.Embedding(n_spk, 256)
38
+
39
+ # conv in stack
40
+ self.stack = nn.Sequential(
41
+ nn.Conv1d(input_channel, 256, 3, 1, 1),
42
+ nn.GroupNorm(4, 256),
43
+ nn.LeakyReLU(),
44
+ nn.Conv1d(256, 256, 3, 1, 1))
45
+
46
+ # transformer
47
+ self.decoder = PCmer(
48
+ num_layers=3,
49
+ num_heads=8,
50
+ dim_model=256,
51
+ dim_keys=256,
52
+ dim_values=256,
53
+ residual_dropout=0.1,
54
+ attention_dropout=0.1)
55
+ self.norm = nn.LayerNorm(256)
56
+
57
+ # out
58
+ self.n_out = sum([v for k, v in output_splits.items()])
59
+ self.dense_out = weight_norm(
60
+ nn.Linear(256, self.n_out))
61
+
62
+ def forward(self, units, f0, phase, volume, spk_id = None, spk_mix_dict = None):
63
+
64
+ '''
65
+ input:
66
+ B x n_frames x n_unit
67
+ return:
68
+ dict of B x n_frames x feat
69
+ '''
70
+
71
+ x = self.stack(units.transpose(1,2)).transpose(1,2)
72
+ x = x + self.f0_embed((1+ f0 / 700).log()) + self.phase_embed(phase / np.pi) + self.volume_embed(volume)
73
+ if self.n_spk is not None and self.n_spk > 1:
74
+ if spk_mix_dict is not None:
75
+ for k, v in spk_mix_dict.items():
76
+ spk_id_torch = torch.LongTensor(np.array([[k]])).to(units.device)
77
+ x = x + v * self.spk_embed(spk_id_torch - 1)
78
+ else:
79
+ x = x + self.spk_embed(spk_id - 1)
80
+ x = self.decoder(x)
81
+ x = self.norm(x)
82
+ e = self.dense_out(x)
83
+ controls = split_to_dict(e, self.output_splits)
84
+
85
+ return controls
86
+
ddsp/vocoder.py ADDED
@@ -0,0 +1,515 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import numpy as np
3
+ import yaml
4
+ import torch
5
+ import torch.nn.functional as F
6
+ import pyworld as pw
7
+ import parselmouth
8
+ import torchcrepe
9
+ import resampy
10
+ from fairseq import checkpoint_utils
11
+ from encoder.hubert.model import HubertSoft
12
+ from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
13
+ from torchaudio.transforms import Resample
14
+ from .unit2control import Unit2Control
15
+ from .core import frequency_filter, upsample, remove_above_fmax
16
+
17
+ class F0_Extractor:
18
+ def __init__(self, f0_extractor, sample_rate = 44100, hop_size = 512, f0_min = 65, f0_max = 800):
19
+ self.f0_extractor = f0_extractor
20
+ self.sample_rate = sample_rate
21
+ self.hop_size = hop_size
22
+ self.f0_min = f0_min
23
+ self.f0_max = f0_max
24
+
25
+ def extract(self, audio, uv_interp = False, device = None, silence_front = 0): # audio: 1d numpy array
26
+ # extractor start time
27
+ n_frames = int(len(audio) // self.hop_size) + 1
28
+
29
+ start_frame = int(silence_front * self.sample_rate / self.hop_size)
30
+ real_silence_front = start_frame * self.hop_size / self.sample_rate
31
+ audio = audio[int(np.round(real_silence_front * self.sample_rate)) : ]
32
+
33
+ # extract f0 using parselmouth
34
+ if self.f0_extractor == 'parselmouth':
35
+ f0 = parselmouth.Sound(audio, self.sample_rate).to_pitch_ac(
36
+ time_step = self.hop_size / self.sample_rate,
37
+ voicing_threshold = 0.6,
38
+ pitch_floor = self.f0_min,
39
+ pitch_ceiling = self.f0_max).selected_array['frequency']
40
+ pad_size = start_frame + (int(len(audio) // self.hop_size) - len(f0) + 1) // 2
41
+ f0 = np.pad(f0,(pad_size, n_frames - len(f0) - pad_size))
42
+
43
+ # extract f0 using dio
44
+ elif self.f0_extractor == 'dio':
45
+ _f0, t = pw.dio(
46
+ audio.astype('double'),
47
+ self.sample_rate,
48
+ f0_floor = self.f0_min,
49
+ f0_ceil = self.f0_max,
50
+ channels_in_octave=2,
51
+ frame_period = (1000 * self.hop_size / self.sample_rate))
52
+ f0 = pw.stonemask(audio.astype('double'), _f0, t, self.sample_rate)
53
+ f0 = np.pad(f0.astype('float'), (start_frame, n_frames - len(f0) - start_frame))
54
+
55
+ # extract f0 using harvest
56
+ elif self.f0_extractor == 'harvest':
57
+ f0, _ = pw.harvest(
58
+ audio.astype('double'),
59
+ self.sample_rate,
60
+ f0_floor = self.f0_min,
61
+ f0_ceil = self.f0_max,
62
+ frame_period = (1000 * self.hop_size / self.sample_rate))
63
+ f0 = np.pad(f0.astype('float'), (start_frame, n_frames - len(f0) - start_frame))
64
+
65
+ # extract f0 using crepe
66
+ elif self.f0_extractor == 'crepe':
67
+ if device is None:
68
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
69
+ wav16k = resampy.resample(audio, self.sample_rate, 16000)
70
+ wav16k_torch = torch.FloatTensor(wav16k).unsqueeze(0).to(device)
71
+
72
+ f0, pd = torchcrepe.predict(wav16k_torch, 16000, 80, self.f0_min, self.f0_max, pad=True, model='full', batch_size=512, device=device, return_periodicity=True)
73
+
74
+ pd = torchcrepe.filter.median(pd, 4)
75
+ pd = torchcrepe.threshold.Silence(-60.)(pd, wav16k_torch, 16000, 80)
76
+ f0 = torchcrepe.threshold.At(0.05)(f0, pd)
77
+ f0 = torchcrepe.filter.mean(f0, 4)
78
+ f0 = torch.where(torch.isnan(f0), torch.full_like(f0, 0), f0)
79
+
80
+ f0 = f0.squeeze(0).cpu().numpy()
81
+ f0 = np.array([f0[int(min(int(np.round(n * self.hop_size / self.sample_rate / 0.005)), len(f0) - 1))] for n in range(n_frames - start_frame)])
82
+ f0 = np.pad(f0, (start_frame, 0))
83
+
84
+ else:
85
+ raise ValueError(f" [x] Unknown f0 extractor: {f0_extractor}")
86
+
87
+ # interpolate the unvoiced f0
88
+ if uv_interp:
89
+ uv = f0 == 0
90
+ if len(f0[~uv]) > 0:
91
+ f0[uv] = np.interp(np.where(uv)[0], np.where(~uv)[0], f0[~uv])
92
+ f0[f0 < self.f0_min] = self.f0_min
93
+ return f0
94
+
95
+
96
+ class Volume_Extractor:
97
+ def __init__(self, hop_size = 512):
98
+ self.hop_size = hop_size
99
+
100
+ def extract(self, audio): # audio: 1d numpy array
101
+ n_frames = int(len(audio) // self.hop_size) + 1
102
+ audio2 = audio ** 2
103
+ audio2 = np.pad(audio2, (int(self.hop_size // 2), int((self.hop_size + 1) // 2)), mode = 'reflect')
104
+ volume = np.array([np.mean(audio2[int(n * self.hop_size) : int((n + 1) * self.hop_size)]) for n in range(n_frames)])
105
+ volume = np.sqrt(volume)
106
+ return volume
107
+
108
+
109
+ class Units_Encoder:
110
+ def __init__(self, encoder, encoder_ckpt, encoder_sample_rate = 16000, encoder_hop_size = 320, device = None):
111
+ if device is None:
112
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
113
+ self.device = device
114
+
115
+ is_loaded_encoder = False
116
+ if encoder == 'hubertsoft':
117
+ self.model = Audio2HubertSoft(encoder_ckpt).to(device)
118
+ is_loaded_encoder = True
119
+ if encoder == 'hubertbase':
120
+ self.model = Audio2HubertBase(encoder_ckpt, device=device)
121
+ is_loaded_encoder = True
122
+ if encoder == 'contentvec':
123
+ self.model = Audio2ContentVec(encoder_ckpt, device=device)
124
+ is_loaded_encoder = True
125
+ if not is_loaded_encoder:
126
+ raise ValueError(f" [x] Unknown units encoder: {encoder}")
127
+
128
+ self.resample_kernel = {}
129
+ self.encoder_sample_rate = encoder_sample_rate
130
+ self.encoder_hop_size = encoder_hop_size
131
+
132
+ def encode(self,
133
+ audio, # B, T
134
+ sample_rate,
135
+ hop_size):
136
+
137
+ # resample
138
+ if sample_rate == self.encoder_sample_rate:
139
+ audio_res = audio
140
+ else:
141
+ key_str = str(sample_rate)
142
+ if key_str not in self.resample_kernel:
143
+ self.resample_kernel[key_str] = Resample(sample_rate, self.encoder_sample_rate, lowpass_filter_width = 128).to(self.device)
144
+ audio_res = self.resample_kernel[key_str](audio)
145
+
146
+ # encode
147
+ if audio_res.size(-1) < self.encoder_hop_size:
148
+ audio_res = torch.nn.functional.pad(audio, (0, self.encoder_hop_size - audio_res.size(-1)))
149
+ units = self.model(audio_res)
150
+
151
+ # alignment
152
+ n_frames = audio.size(-1) // hop_size + 1
153
+ ratio = (hop_size / sample_rate) / (self.encoder_hop_size / self.encoder_sample_rate)
154
+ index = torch.clamp(torch.round(ratio * torch.arange(n_frames).to(self.device)).long(), max = units.size(1) - 1)
155
+ units_aligned = torch.gather(units, 1, index.unsqueeze(0).unsqueeze(-1).repeat([1, 1, units.size(-1)]))
156
+ return units_aligned
157
+
158
+ class Audio2HubertSoft(torch.nn.Module):
159
+ def __init__(self, path, h_sample_rate = 16000, h_hop_size = 320):
160
+ super().__init__()
161
+ print(' [Encoder Model] HuBERT Soft')
162
+ self.hubert = HubertSoft()
163
+ print(' [Loading] ' + path)
164
+ checkpoint = torch.load(path)
165
+ consume_prefix_in_state_dict_if_present(checkpoint, "module.")
166
+ self.hubert.load_state_dict(checkpoint)
167
+ self.hubert.eval()
168
+
169
+ def forward(self,
170
+ audio): # B, T
171
+ with torch.inference_mode():
172
+ units = self.hubert.units(audio.unsqueeze(1))
173
+ return units
174
+
175
+
176
+ class Audio2ContentVec():
177
+ def __init__(self, path, h_sample_rate=16000, h_hop_size=320, device='cpu'):
178
+ self.device = device
179
+ print(' [Encoder Model] Content Vec')
180
+ print(' [Loading] ' + path)
181
+ self.models, self.saved_cfg, self.task = checkpoint_utils.load_model_ensemble_and_task([path], suffix="", )
182
+ self.hubert = self.models[0]
183
+ self.hubert = self.hubert.to(self.device)
184
+ self.hubert.eval()
185
+
186
+ def __call__(self,
187
+ audio): # B, T
188
+ # wav_tensor = torch.from_numpy(audio).to(self.device)
189
+ wav_tensor = audio
190
+ feats = wav_tensor.view(1, -1)
191
+ padding_mask = torch.BoolTensor(feats.shape).fill_(False)
192
+ inputs = {
193
+ "source": feats.to(wav_tensor.device),
194
+ "padding_mask": padding_mask.to(wav_tensor.device),
195
+ "output_layer": 9, # layer 9
196
+ }
197
+ with torch.no_grad():
198
+ logits = self.hubert.extract_features(**inputs)
199
+ feats = self.hubert.final_proj(logits[0])
200
+ units = feats # .transpose(2, 1)
201
+ return units
202
+
203
+
204
+ class Audio2HubertBase():
205
+ def __init__(self, path, h_sample_rate=16000, h_hop_size=320, device='cpu'):
206
+ self.device = device
207
+ print(' [Encoder Model] HuBERT Base')
208
+ print(' [Loading] ' + path)
209
+ self.models, self.saved_cfg, self.task = checkpoint_utils.load_model_ensemble_and_task([path], suffix="", )
210
+ self.hubert = self.models[0]
211
+ self.hubert = self.hubert.to(self.device)
212
+ self.hubert = self.hubert.float()
213
+ self.hubert.eval()
214
+
215
+ def __call__(self,
216
+ audio): # B, T
217
+ with torch.no_grad():
218
+ padding_mask = torch.BoolTensor(audio.shape).fill_(False)
219
+ inputs = {
220
+ "source": audio.to(self.device),
221
+ "padding_mask": padding_mask.to(self.device),
222
+ "output_layer": 9, # layer 9
223
+ }
224
+ logits = self.hubert.extract_features(**inputs)
225
+ units = self.hubert.final_proj(logits[0])
226
+ return units
227
+
228
+
229
+ class DotDict(dict):
230
+ def __getattr__(*args):
231
+ val = dict.get(*args)
232
+ return DotDict(val) if type(val) is dict else val
233
+
234
+ __setattr__ = dict.__setitem__
235
+ __delattr__ = dict.__delitem__
236
+
237
+ def load_model(
238
+ model_path,
239
+ device='cpu'):
240
+ config_file = os.path.join(os.path.split(model_path)[0], 'config.yaml')
241
+ with open(config_file, "r") as config:
242
+ args = yaml.safe_load(config)
243
+ args = DotDict(args)
244
+
245
+ # load model
246
+ model = None
247
+
248
+ if args.model.type == 'Sins':
249
+ model = Sins(
250
+ sampling_rate=args.data.sampling_rate,
251
+ block_size=args.data.block_size,
252
+ n_harmonics=args.model.n_harmonics,
253
+ n_mag_allpass=args.model.n_mag_allpass,
254
+ n_mag_noise=args.model.n_mag_noise,
255
+ n_unit=args.data.encoder_out_channels,
256
+ n_spk=args.model.n_spk)
257
+
258
+ elif args.model.type == 'CombSub':
259
+ model = CombSub(
260
+ sampling_rate=args.data.sampling_rate,
261
+ block_size=args.data.block_size,
262
+ n_mag_allpass=args.model.n_mag_allpass,
263
+ n_mag_harmonic=args.model.n_mag_harmonic,
264
+ n_mag_noise=args.model.n_mag_noise,
265
+ n_unit=args.data.encoder_out_channels,
266
+ n_spk=args.model.n_spk)
267
+
268
+ elif args.model.type == 'CombSubFast':
269
+ model = CombSubFast(
270
+ sampling_rate=args.data.sampling_rate,
271
+ block_size=args.data.block_size,
272
+ n_unit=args.data.encoder_out_channels,
273
+ n_spk=args.model.n_spk)
274
+
275
+ else:
276
+ raise ValueError(f" [x] Unknown Model: {args.model.type}")
277
+
278
+ print(' [Loading] ' + model_path)
279
+ ckpt = torch.load(model_path, map_location=torch.device(device))
280
+ model.to(device)
281
+ model.load_state_dict(ckpt['model'])
282
+ model.eval()
283
+ return model, args
284
+
285
+
286
+ class Sins(torch.nn.Module):
287
+ def __init__(self,
288
+ sampling_rate,
289
+ block_size,
290
+ n_harmonics,
291
+ n_mag_allpass,
292
+ n_mag_noise,
293
+ n_unit=256,
294
+ n_spk=1):
295
+ super().__init__()
296
+
297
+ print(' [DDSP Model] Sinusoids Additive Synthesiser')
298
+
299
+ # params
300
+ self.register_buffer("sampling_rate", torch.tensor(sampling_rate))
301
+ self.register_buffer("block_size", torch.tensor(block_size))
302
+ # Unit2Control
303
+ split_map = {
304
+ 'amplitudes': n_harmonics,
305
+ 'group_delay': n_mag_allpass,
306
+ 'noise_magnitude': n_mag_noise,
307
+ }
308
+ self.unit2ctrl = Unit2Control(n_unit, n_spk, split_map)
309
+
310
+ def forward(self, units_frames, f0_frames, volume_frames, spk_id=None, spk_mix_dict=None, initial_phase=None, infer=True, max_upsample_dim=32):
311
+ '''
312
+ units_frames: B x n_frames x n_unit
313
+ f0_frames: B x n_frames x 1
314
+ volume_frames: B x n_frames x 1
315
+ spk_id: B x 1
316
+ '''
317
+ # exciter phase
318
+ f0 = upsample(f0_frames, self.block_size)
319
+ if infer:
320
+ x = torch.cumsum(f0.double() / self.sampling_rate, axis=1)
321
+ else:
322
+ x = torch.cumsum(f0 / self.sampling_rate, axis=1)
323
+ if initial_phase is not None:
324
+ x += initial_phase.to(x) / 2 / np.pi
325
+ x = x - torch.round(x)
326
+ x = x.to(f0)
327
+
328
+ phase = 2 * np.pi * x
329
+ phase_frames = phase[:, ::self.block_size, :]
330
+
331
+ # parameter prediction
332
+ ctrls = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict)
333
+
334
+ amplitudes_frames = torch.exp(ctrls['amplitudes'])/ 128
335
+ group_delay = np.pi * torch.tanh(ctrls['group_delay'])
336
+ noise_param = torch.exp(ctrls['noise_magnitude']) / 128
337
+
338
+ # sinusoids exciter signal
339
+ amplitudes_frames = remove_above_fmax(amplitudes_frames, f0_frames, self.sampling_rate / 2, level_start = 1)
340
+ n_harmonic = amplitudes_frames.shape[-1]
341
+ level_harmonic = torch.arange(1, n_harmonic + 1).to(phase)
342
+ sinusoids = 0.
343
+ for n in range(( n_harmonic - 1) // max_upsample_dim + 1):
344
+ start = n * max_upsample_dim
345
+ end = (n + 1) * max_upsample_dim
346
+ phases = phase * level_harmonic[start:end]
347
+ amplitudes = upsample(amplitudes_frames[:,:,start:end], self.block_size)
348
+ sinusoids += (torch.sin(phases) * amplitudes).sum(-1)
349
+
350
+ # harmonic part filter (apply group-delay)
351
+ harmonic = frequency_filter(
352
+ sinusoids,
353
+ torch.exp(1.j * torch.cumsum(group_delay, axis = -1)),
354
+ hann_window = False)
355
+
356
+ # noise part filter
357
+ noise = torch.rand_like(harmonic) * 2 - 1
358
+ noise = frequency_filter(
359
+ noise,
360
+ torch.complex(noise_param, torch.zeros_like(noise_param)),
361
+ hann_window = True)
362
+
363
+ signal = harmonic + noise
364
+
365
+ return signal, phase, (harmonic, noise) #, (noise_param, noise_param)
366
+
367
+ class CombSubFast(torch.nn.Module):
368
+ def __init__(self,
369
+ sampling_rate,
370
+ block_size,
371
+ n_unit=256,
372
+ n_spk=1):
373
+ super().__init__()
374
+
375
+ print(' [DDSP Model] Combtooth Subtractive Synthesiser')
376
+ # params
377
+ self.register_buffer("sampling_rate", torch.tensor(sampling_rate))
378
+ self.register_buffer("block_size", torch.tensor(block_size))
379
+ self.register_buffer("window", torch.sqrt(torch.hann_window(2 * block_size)))
380
+ #Unit2Control
381
+ split_map = {
382
+ 'harmonic_magnitude': block_size + 1,
383
+ 'harmonic_phase': block_size + 1,
384
+ 'noise_magnitude': block_size + 1
385
+ }
386
+ self.unit2ctrl = Unit2Control(n_unit, n_spk, split_map)
387
+
388
+ def forward(self, units_frames, f0_frames, volume_frames, spk_id=None, spk_mix_dict=None, initial_phase=None, infer=True, **kwargs):
389
+ '''
390
+ units_frames: B x n_frames x n_unit
391
+ f0_frames: B x n_frames x 1
392
+ volume_frames: B x n_frames x 1
393
+ spk_id: B x 1
394
+ '''
395
+ # exciter phase
396
+ f0 = upsample(f0_frames, self.block_size)
397
+ if infer:
398
+ x = torch.cumsum(f0.double() / self.sampling_rate, axis=1)
399
+ else:
400
+ x = torch.cumsum(f0 / self.sampling_rate, axis=1)
401
+ if initial_phase is not None:
402
+ x += initial_phase.to(x) / 2 / np.pi
403
+ x = x - torch.round(x)
404
+ x = x.to(f0)
405
+
406
+ phase_frames = 2 * np.pi * x[:, ::self.block_size, :]
407
+
408
+ # parameter prediction
409
+ ctrls = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict)
410
+
411
+ src_filter = torch.exp(ctrls['harmonic_magnitude'] + 1.j * np.pi * ctrls['harmonic_phase'])
412
+ src_filter = torch.cat((src_filter, src_filter[:,-1:,:]), 1)
413
+ noise_filter= torch.exp(ctrls['noise_magnitude']) / 128
414
+ noise_filter = torch.cat((noise_filter, noise_filter[:,-1:,:]), 1)
415
+
416
+ # combtooth exciter signal
417
+ combtooth = torch.sinc(self.sampling_rate * x / (f0 + 1e-3))
418
+ combtooth = combtooth.squeeze(-1)
419
+ combtooth_frames = F.pad(combtooth, (self.block_size, self.block_size)).unfold(1, 2 * self.block_size, self.block_size)
420
+ combtooth_frames = combtooth_frames * self.window
421
+ combtooth_fft = torch.fft.rfft(combtooth_frames, 2 * self.block_size)
422
+
423
+ # noise exciter signal
424
+ noise = torch.rand_like(combtooth) * 2 - 1
425
+ noise_frames = F.pad(noise, (self.block_size, self.block_size)).unfold(1, 2 * self.block_size, self.block_size)
426
+ noise_frames = noise_frames * self.window
427
+ noise_fft = torch.fft.rfft(noise_frames, 2 * self.block_size)
428
+
429
+ # apply the filters
430
+ signal_fft = combtooth_fft * src_filter + noise_fft * noise_filter
431
+
432
+ # take the ifft to resynthesize audio.
433
+ signal_frames_out = torch.fft.irfft(signal_fft, 2 * self.block_size) * self.window
434
+
435
+ # overlap add
436
+ fold = torch.nn.Fold(output_size=(1, (signal_frames_out.size(1) + 1) * self.block_size), kernel_size=(1, 2 * self.block_size), stride=(1, self.block_size))
437
+ signal = fold(signal_frames_out.transpose(1, 2))[:, 0, 0, self.block_size : -self.block_size]
438
+
439
+ return signal, phase_frames, (signal, signal)
440
+
441
+ class CombSub(torch.nn.Module):
442
+ def __init__(self,
443
+ sampling_rate,
444
+ block_size,
445
+ n_mag_allpass,
446
+ n_mag_harmonic,
447
+ n_mag_noise,
448
+ n_unit=256,
449
+ n_spk=1):
450
+ super().__init__()
451
+
452
+ print(' [DDSP Model] Combtooth Subtractive Synthesiser (Old Version)')
453
+ # params
454
+ self.register_buffer("sampling_rate", torch.tensor(sampling_rate))
455
+ self.register_buffer("block_size", torch.tensor(block_size))
456
+ #Unit2Control
457
+ split_map = {
458
+ 'group_delay': n_mag_allpass,
459
+ 'harmonic_magnitude': n_mag_harmonic,
460
+ 'noise_magnitude': n_mag_noise
461
+ }
462
+ self.unit2ctrl = Unit2Control(n_unit, n_spk, split_map)
463
+
464
+ def forward(self, units_frames, f0_frames, volume_frames, spk_id=None, spk_mix_dict=None, initial_phase=None, infer=True, **kwargs):
465
+ '''
466
+ units_frames: B x n_frames x n_unit
467
+ f0_frames: B x n_frames x 1
468
+ volume_frames: B x n_frames x 1
469
+ spk_id: B x 1
470
+ '''
471
+ # exciter phase
472
+ f0 = upsample(f0_frames, self.block_size)
473
+ if infer:
474
+ x = torch.cumsum(f0.double() / self.sampling_rate, axis=1)
475
+ else:
476
+ x = torch.cumsum(f0 / self.sampling_rate, axis=1)
477
+ if initial_phase is not None:
478
+ x += initial_phase.to(x) / 2 / np.pi
479
+ x = x - torch.round(x)
480
+ x = x.to(f0)
481
+
482
+ phase_frames = 2 * np.pi * x[:, ::self.block_size, :]
483
+
484
+ # parameter prediction
485
+ ctrls = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict)
486
+
487
+ group_delay = np.pi * torch.tanh(ctrls['group_delay'])
488
+ src_param = torch.exp(ctrls['harmonic_magnitude'])
489
+ noise_param = torch.exp(ctrls['noise_magnitude']) / 128
490
+
491
+ # combtooth exciter signal
492
+ combtooth = torch.sinc(self.sampling_rate * x / (f0 + 1e-3))
493
+ combtooth = combtooth.squeeze(-1)
494
+
495
+ # harmonic part filter (using dynamic-windowed LTV-FIR, with group-delay prediction)
496
+ harmonic = frequency_filter(
497
+ combtooth,
498
+ torch.exp(1.j * torch.cumsum(group_delay, axis = -1)),
499
+ hann_window = False)
500
+ harmonic = frequency_filter(
501
+ harmonic,
502
+ torch.complex(src_param, torch.zeros_like(src_param)),
503
+ hann_window = True,
504
+ half_width_frames = 1.5 * self.sampling_rate / (f0_frames + 1e-3))
505
+
506
+ # noise part filter (using constant-windowed LTV-FIR, without group-delay)
507
+ noise = torch.rand_like(harmonic) * 2 - 1
508
+ noise = frequency_filter(
509
+ noise,
510
+ torch.complex(noise_param, torch.zeros_like(noise_param)),
511
+ hann_window = True)
512
+
513
+ signal = harmonic + noise
514
+
515
+ return signal, phase_frames, (harmonic, noise)
draw.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import tqdm
3
+ import matplotlib.pyplot as plt
4
+ import os
5
+ import shutil
6
+ import wave
7
+
8
+ WAV_MIN_LENGTH = 2 # wav文件的最短时长 / The minimum duration of wav files
9
+ SAMPLE_RATE = 1 # 抽取文件数量的百分比 / The percentage of files to be extracted
10
+ SAMPLE_MIN = 2 # 抽取的文件数量下限 / The lower limit of the number of files to be extracted
11
+ SAMPLE_MAX = 10 # 抽取的文件数量上限 / The upper limit of the number of files to be extracted
12
+
13
+
14
+ # 定义一个函数,用于检查wav文件的时长是否大于最短时长
15
+ def check_duration(wav_file):
16
+ # 打开wav文件
17
+ f = wave.open(wav_file, "rb")
18
+ # 获取帧数和帧率
19
+ frames = f.getnframes()
20
+ rate = f.getframerate()
21
+ # 计算时长(秒)
22
+ duration = frames / float(rate)
23
+ # 关闭文件
24
+ f.close()
25
+ # 返回时长是否大于最短时长的布尔值
26
+ return duration > WAV_MIN_LENGTH
27
+
28
+ # 定义一个函数,用于从给定的目录中随机抽取一定比例的wav文件,并剪切到另一个目录中,保留数据结构
29
+ def split_data(src_dir, dst_dir, ratio):
30
+ # 创建目标目录(如果不存在)
31
+ if not os.path.exists(dst_dir):
32
+ os.makedirs(dst_dir)
33
+
34
+ # 获取源目录下所有的子目录和文件名(不包括子目录下的内容)
35
+ subdirs, files = [], []
36
+ for item in os.listdir(src_dir):
37
+ item_path = os.path.join(src_dir, item)
38
+ if os.path.isdir(item_path):
39
+ subdirs.append(item)
40
+ elif os.path.isfile(item_path) and item.endswith(".wav"):
41
+ files.append(item)
42
+
43
+ # 如果源目录下没有任何wav文件,则报错并退出函数
44
+ if len(files) == 0:
45
+ print(f"Error: No wav files found in {src_dir}")
46
+ return
47
+
48
+ # 计算需要抽取的wav文件数量
49
+ num_files = int(len(files) * ratio)
50
+ num_files = max(SAMPLE_MIN, min(SAMPLE_MAX, num_files))
51
+
52
+ # 随机打乱文件名列表,并取出前num_files个作为抽取结果
53
+ np.random.shuffle(files)
54
+ selected_files = files[:num_files]
55
+
56
+ # 创建一个进度条对象,用于显示程序的运行进度
57
+ pbar = tqdm.tqdm(total=num_files)
58
+
59
+ # 遍历抽取结果中的每个文件名
60
+ for file in selected_files:
61
+ # 拼接源文件和目标文件的完整路径
62
+ src_file = os.path.join(src_dir, file)
63
+ dst_file = os.path.join(dst_dir, file)
64
+ # 检查源文件的时长是否大于2秒
65
+ if check_duration(src_file):
66
+ # 如果是,则剪切源文件到目标目录中
67
+ shutil.move(src_file, dst_file)
68
+ # 更新进度条
69
+ pbar.update(1)
70
+ else:
71
+ # 如果不是,则打印源文件的文件名,并跳过该文件
72
+ print(f"Skipped {src_file} because its duration is less than 2 seconds.")
73
+
74
+ # 关闭进度条
75
+ pbar.close()
76
+
77
+ # 遍历源目录下所有的子目录(如果有)
78
+ for subdir in subdirs:
79
+ # 拼接子目录在源目录和目标目录中的完整路径
80
+ src_subdir = os.path.join(src_dir, subdir)
81
+ dst_subdir = os.path.join(dst_dir, subdir)
82
+ # 递归地调用本函数,对子目录中的wav文件进行同样的操作,保留数据结构
83
+ split_data(src_subdir, dst_subdir, ratio)
84
+
85
+ # 定义主函数,用于获取用户输入并调用上述函数
86
+
87
+ def main():
88
+ root_dir = os.path.abspath('.')
89
+ dst_dir = root_dir + "/data/val/audio"
90
+ # 抽取比例,默认为1
91
+ ratio = float(SAMPLE_RATE) / 100
92
+
93
+ # 固定源目录为根目录下/data/train/audio目录
94
+ src_dir = root_dir + "/data/train/audio"
95
+
96
+ # 调用split_data函数,对源目录中的wav文件进行抽取,并剪切到目标目录中,保留数据结构
97
+ split_data(src_dir, dst_dir, ratio)
98
+
99
+ # 如果本模块是主模块,则执行主函数
100
+ if __name__ == "__main__":
101
+ main()
encoder/hubert/model.py ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy
2
+ from typing import Optional, Tuple
3
+ import random
4
+
5
+ from sklearn.cluster import KMeans
6
+
7
+ import torch
8
+ import torch.nn as nn
9
+ import torch.nn.functional as F
10
+ from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
11
+
12
+ URLS = {
13
+ "hubert-discrete": "https://github.com/bshall/hubert/releases/download/v0.1/hubert-discrete-e9416457.pt",
14
+ "hubert-soft": "https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt",
15
+ "kmeans100": "https://github.com/bshall/hubert/releases/download/v0.1/kmeans100-50f36a95.pt",
16
+ }
17
+
18
+
19
+ class Hubert(nn.Module):
20
+ def __init__(self, num_label_embeddings: int = 100, mask: bool = True):
21
+ super().__init__()
22
+ self._mask = mask
23
+ self.feature_extractor = FeatureExtractor()
24
+ self.feature_projection = FeatureProjection()
25
+ self.positional_embedding = PositionalConvEmbedding()
26
+ self.norm = nn.LayerNorm(768)
27
+ self.dropout = nn.Dropout(0.1)
28
+ self.encoder = TransformerEncoder(
29
+ nn.TransformerEncoderLayer(
30
+ 768, 12, 3072, activation="gelu", batch_first=True
31
+ ),
32
+ 12,
33
+ )
34
+ self.proj = nn.Linear(768, 256)
35
+
36
+ self.masked_spec_embed = nn.Parameter(torch.FloatTensor(768).uniform_())
37
+ self.label_embedding = nn.Embedding(num_label_embeddings, 256)
38
+
39
+ def mask(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
40
+ mask = None
41
+ if self.training and self._mask:
42
+ mask = _compute_mask((x.size(0), x.size(1)), 0.8, 10, x.device, 2)
43
+ x[mask] = self.masked_spec_embed.to(x.dtype)
44
+ return x, mask
45
+
46
+ def encode(
47
+ self, x: torch.Tensor, layer: Optional[int] = None
48
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
49
+ x = self.feature_extractor(x)
50
+ x = self.feature_projection(x.transpose(1, 2))
51
+ x, mask = self.mask(x)
52
+ x = x + self.positional_embedding(x)
53
+ x = self.dropout(self.norm(x))
54
+ x = self.encoder(x, output_layer=layer)
55
+ return x, mask
56
+
57
+ def logits(self, x: torch.Tensor) -> torch.Tensor:
58
+ logits = torch.cosine_similarity(
59
+ x.unsqueeze(2),
60
+ self.label_embedding.weight.unsqueeze(0).unsqueeze(0),
61
+ dim=-1,
62
+ )
63
+ return logits / 0.1
64
+
65
+ def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
66
+ x, mask = self.encode(x)
67
+ x = self.proj(x)
68
+ logits = self.logits(x)
69
+ return logits, mask
70
+
71
+
72
+ class HubertSoft(Hubert):
73
+ def __init__(self):
74
+ super().__init__()
75
+
76
+ @torch.inference_mode()
77
+ def units(self, wav: torch.Tensor) -> torch.Tensor:
78
+ wav = F.pad(wav, ((400 - 320) // 2, (400 - 320) // 2))
79
+ x, _ = self.encode(wav)
80
+ return self.proj(x)
81
+
82
+
83
+ class HubertDiscrete(Hubert):
84
+ def __init__(self, kmeans):
85
+ super().__init__(504)
86
+ self.kmeans = kmeans
87
+
88
+ @torch.inference_mode()
89
+ def units(self, wav: torch.Tensor) -> torch.LongTensor:
90
+ wav = F.pad(wav, ((400 - 320) // 2, (400 - 320) // 2))
91
+ x, _ = self.encode(wav, layer=7)
92
+ x = self.kmeans.predict(x.squeeze().cpu().numpy())
93
+ return torch.tensor(x, dtype=torch.long, device=wav.device)
94
+
95
+
96
+ class FeatureExtractor(nn.Module):
97
+ def __init__(self):
98
+ super().__init__()
99
+ self.conv0 = nn.Conv1d(1, 512, 10, 5, bias=False)
100
+ self.norm0 = nn.GroupNorm(512, 512)
101
+ self.conv1 = nn.Conv1d(512, 512, 3, 2, bias=False)
102
+ self.conv2 = nn.Conv1d(512, 512, 3, 2, bias=False)
103
+ self.conv3 = nn.Conv1d(512, 512, 3, 2, bias=False)
104
+ self.conv4 = nn.Conv1d(512, 512, 3, 2, bias=False)
105
+ self.conv5 = nn.Conv1d(512, 512, 2, 2, bias=False)
106
+ self.conv6 = nn.Conv1d(512, 512, 2, 2, bias=False)
107
+
108
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
109
+ x = F.gelu(self.norm0(self.conv0(x)))
110
+ x = F.gelu(self.conv1(x))
111
+ x = F.gelu(self.conv2(x))
112
+ x = F.gelu(self.conv3(x))
113
+ x = F.gelu(self.conv4(x))
114
+ x = F.gelu(self.conv5(x))
115
+ x = F.gelu(self.conv6(x))
116
+ return x
117
+
118
+
119
+ class FeatureProjection(nn.Module):
120
+ def __init__(self):
121
+ super().__init__()
122
+ self.norm = nn.LayerNorm(512)
123
+ self.projection = nn.Linear(512, 768)
124
+ self.dropout = nn.Dropout(0.1)
125
+
126
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
127
+ x = self.norm(x)
128
+ x = self.projection(x)
129
+ x = self.dropout(x)
130
+ return x
131
+
132
+
133
+ class PositionalConvEmbedding(nn.Module):
134
+ def __init__(self):
135
+ super().__init__()
136
+ self.conv = nn.Conv1d(
137
+ 768,
138
+ 768,
139
+ kernel_size=128,
140
+ padding=128 // 2,
141
+ groups=16,
142
+ )
143
+ self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
144
+
145
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
146
+ x = self.conv(x.transpose(1, 2))
147
+ x = F.gelu(x[:, :, :-1])
148
+ return x.transpose(1, 2)
149
+
150
+
151
+ class TransformerEncoder(nn.Module):
152
+ def __init__(
153
+ self, encoder_layer: nn.TransformerEncoderLayer, num_layers: int
154
+ ) -> None:
155
+ super(TransformerEncoder, self).__init__()
156
+ self.layers = nn.ModuleList(
157
+ [copy.deepcopy(encoder_layer) for _ in range(num_layers)]
158
+ )
159
+ self.num_layers = num_layers
160
+
161
+ def forward(
162
+ self,
163
+ src: torch.Tensor,
164
+ mask: torch.Tensor = None,
165
+ src_key_padding_mask: torch.Tensor = None,
166
+ output_layer: Optional[int] = None,
167
+ ) -> torch.Tensor:
168
+ output = src
169
+ for layer in self.layers[:output_layer]:
170
+ output = layer(
171
+ output, src_mask=mask, src_key_padding_mask=src_key_padding_mask
172
+ )
173
+ return output
174
+
175
+
176
+ def _compute_mask(
177
+ shape: Tuple[int, int],
178
+ mask_prob: float,
179
+ mask_length: int,
180
+ device: torch.device,
181
+ min_masks: int = 0,
182
+ ) -> torch.Tensor:
183
+ batch_size, sequence_length = shape
184
+
185
+ if mask_length < 1:
186
+ raise ValueError("`mask_length` has to be bigger than 0.")
187
+
188
+ if mask_length > sequence_length:
189
+ raise ValueError(
190
+ f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"
191
+ )
192
+
193
+ # compute number of masked spans in batch
194
+ num_masked_spans = int(mask_prob * sequence_length / mask_length + random.random())
195
+ num_masked_spans = max(num_masked_spans, min_masks)
196
+
197
+ # make sure num masked indices <= sequence_length
198
+ if num_masked_spans * mask_length > sequence_length:
199
+ num_masked_spans = sequence_length // mask_length
200
+
201
+ # SpecAugment mask to fill
202
+ mask = torch.zeros((batch_size, sequence_length), device=device, dtype=torch.bool)
203
+
204
+ # uniform distribution to sample from, make sure that offset samples are < sequence_length
205
+ uniform_dist = torch.ones(
206
+ (batch_size, sequence_length - (mask_length - 1)), device=device
207
+ )
208
+
209
+ # get random indices to mask
210
+ mask_indices = torch.multinomial(uniform_dist, num_masked_spans)
211
+
212
+ # expand masked indices to masked spans
213
+ mask_indices = (
214
+ mask_indices.unsqueeze(dim=-1)
215
+ .expand((batch_size, num_masked_spans, mask_length))
216
+ .reshape(batch_size, num_masked_spans * mask_length)
217
+ )
218
+ offsets = (
219
+ torch.arange(mask_length, device=device)[None, None, :]
220
+ .expand((batch_size, num_masked_spans, mask_length))
221
+ .reshape(batch_size, num_masked_spans * mask_length)
222
+ )
223
+ mask_idxs = mask_indices + offsets
224
+
225
+ # scatter indices to mask
226
+ mask = mask.scatter(1, mask_idxs, True)
227
+
228
+ return mask
229
+
230
+
231
+ def hubert_discrete(
232
+ pretrained: bool = True,
233
+ progress: bool = True,
234
+ ) -> HubertDiscrete:
235
+ r"""HuBERT-Discrete from `"A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion"`.
236
+ Args:
237
+ pretrained (bool): load pretrained weights into the model
238
+ progress (bool): show progress bar when downloading model
239
+ """
240
+ kmeans = kmeans100(pretrained=pretrained, progress=progress)
241
+ hubert = HubertDiscrete(kmeans)
242
+ if pretrained:
243
+ checkpoint = torch.hub.load_state_dict_from_url(
244
+ URLS["hubert-discrete"], progress=progress
245
+ )
246
+ consume_prefix_in_state_dict_if_present(checkpoint, "module.")
247
+ hubert.load_state_dict(checkpoint)
248
+ hubert.eval()
249
+ return hubert
250
+
251
+
252
+ def hubert_soft(
253
+ pretrained: bool = True,
254
+ progress: bool = True,
255
+ ) -> HubertSoft:
256
+ r"""HuBERT-Soft from `"A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion"`.
257
+ Args:
258
+ pretrained (bool): load pretrained weights into the model
259
+ progress (bool): show progress bar when downloading model
260
+ """
261
+ hubert = HubertSoft()
262
+ if pretrained:
263
+ checkpoint = torch.hub.load_state_dict_from_url(
264
+ URLS["hubert-soft"], progress=progress
265
+ )
266
+ consume_prefix_in_state_dict_if_present(checkpoint, "module.")
267
+ hubert.load_state_dict(checkpoint)
268
+ hubert.eval()
269
+ return hubert
270
+
271
+
272
+ def _kmeans(
273
+ num_clusters: int, pretrained: bool = True, progress: bool = True
274
+ ) -> KMeans:
275
+ kmeans = KMeans(num_clusters)
276
+ if pretrained:
277
+ checkpoint = torch.hub.load_state_dict_from_url(
278
+ URLS[f"kmeans{num_clusters}"], progress=progress
279
+ )
280
+ kmeans.__dict__["n_features_in_"] = checkpoint["n_features_in_"]
281
+ kmeans.__dict__["_n_threads"] = checkpoint["_n_threads"]
282
+ kmeans.__dict__["cluster_centers_"] = checkpoint["cluster_centers_"].numpy()
283
+ return kmeans
284
+
285
+
286
+ def kmeans100(pretrained: bool = True, progress: bool = True) -> KMeans:
287
+ r"""
288
+ k-means checkpoint for HuBERT-Discrete with 100 clusters.
289
+ Args:
290
+ pretrained (bool): load pretrained weights into the model
291
+ progress (bool): show progress bar when downloading model
292
+ """
293
+ return _kmeans(100, pretrained, progress)
enhancer.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import torch
3
+ import torch.nn.functional as F
4
+ from nsf_hifigan.nvSTFT import STFT
5
+ from nsf_hifigan.models import load_model
6
+ from torchaudio.transforms import Resample
7
+
8
+ class Enhancer:
9
+ def __init__(self, enhancer_type, enhancer_ckpt, device=None):
10
+ if device is None:
11
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
12
+ self.device = device
13
+
14
+ if enhancer_type == 'nsf-hifigan':
15
+ self.enhancer = NsfHifiGAN(enhancer_ckpt, device=self.device)
16
+ else:
17
+ raise ValueError(f" [x] Unknown enhancer: {enhancer_type}")
18
+
19
+ self.resample_kernel = {}
20
+ self.enhancer_sample_rate = self.enhancer.sample_rate()
21
+ self.enhancer_hop_size = self.enhancer.hop_size()
22
+
23
+ def enhance(self,
24
+ audio, # 1, T
25
+ sample_rate,
26
+ f0, # 1, n_frames, 1
27
+ hop_size,
28
+ adaptive_key = 0,
29
+ silence_front = 0
30
+ ):
31
+ # enhancer start time
32
+ start_frame = int(silence_front * sample_rate / hop_size)
33
+ real_silence_front = start_frame * hop_size / sample_rate
34
+ audio = audio[:, int(np.round(real_silence_front * sample_rate)) : ]
35
+ f0 = f0[: , start_frame :, :]
36
+
37
+ # adaptive parameters
38
+ adaptive_factor = 2 ** ( -adaptive_key / 12)
39
+ adaptive_sample_rate = 100 * int(np.round(self.enhancer_sample_rate / adaptive_factor / 100))
40
+ real_factor = self.enhancer_sample_rate / adaptive_sample_rate
41
+
42
+ # resample the ddsp output
43
+ if sample_rate == adaptive_sample_rate:
44
+ audio_res = audio
45
+ else:
46
+ key_str = str(sample_rate) + str(adaptive_sample_rate)
47
+ if key_str not in self.resample_kernel:
48
+ self.resample_kernel[key_str] = Resample(sample_rate, adaptive_sample_rate, lowpass_filter_width = 128).to(self.device)
49
+ audio_res = self.resample_kernel[key_str](audio)
50
+
51
+ n_frames = int(audio_res.size(-1) // self.enhancer_hop_size + 1)
52
+
53
+ # resample f0
54
+ f0_np = f0.squeeze(0).squeeze(-1).cpu().numpy()
55
+ f0_np *= real_factor
56
+ time_org = (hop_size / sample_rate) * np.arange(len(f0_np)) / real_factor
57
+ time_frame = (self.enhancer_hop_size / self.enhancer_sample_rate) * np.arange(n_frames)
58
+ f0_res = np.interp(time_frame, time_org, f0_np, left=f0_np[0], right=f0_np[-1])
59
+ f0_res = torch.from_numpy(f0_res).unsqueeze(0).float().to(self.device) # 1, n_frames
60
+
61
+ # enhance
62
+ enhanced_audio, enhancer_sample_rate = self.enhancer(audio_res, f0_res)
63
+
64
+ # resample the enhanced output
65
+ if adaptive_factor != 0:
66
+ key_str = str(adaptive_sample_rate) + str(enhancer_sample_rate)
67
+ if key_str not in self.resample_kernel:
68
+ self.resample_kernel[key_str] = Resample(adaptive_sample_rate, enhancer_sample_rate, lowpass_filter_width = 128).to(self.device)
69
+ enhanced_audio = self.resample_kernel[key_str](enhanced_audio)
70
+
71
+ # pad the silence frames
72
+ if start_frame > 0:
73
+ enhanced_audio = F.pad(enhanced_audio, (int(np.round(enhancer_sample_rate * real_silence_front)), 0))
74
+
75
+ return enhanced_audio, enhancer_sample_rate
76
+
77
+
78
+ class NsfHifiGAN(torch.nn.Module):
79
+ def __init__(self, model_path, device=None):
80
+ super().__init__()
81
+ if device is None:
82
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
83
+ self.device = device
84
+ print('| Load HifiGAN: ', model_path)
85
+ self.model, self.h = load_model(model_path, device=self.device)
86
+
87
+ def sample_rate(self):
88
+ return self.h.sampling_rate
89
+
90
+ def hop_size(self):
91
+ return self.h.hop_size
92
+
93
+ def forward(self, audio, f0):
94
+ stft = STFT(
95
+ self.h.sampling_rate,
96
+ self.h.num_mels,
97
+ self.h.n_fft,
98
+ self.h.win_size,
99
+ self.h.hop_size,
100
+ self.h.fmin,
101
+ self.h.fmax)
102
+ with torch.no_grad():
103
+ mel = stft.get_mel(audio)
104
+ enhanced_audio = self.model(mel, f0[:,:mel.size(-1)]).view(-1)
105
+ return enhanced_audio, self.h.sampling_rate
exp/gitkeep ADDED
File without changes
flask_api.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ import logging
3
+ import torch
4
+ import numpy as np
5
+ import slicer
6
+ import soundfile as sf
7
+ import librosa
8
+ from flask import Flask, request, send_file
9
+ from flask_cors import CORS
10
+
11
+ from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
12
+ from ddsp.core import upsample
13
+ from enhancer import Enhancer
14
+
15
+
16
+ app = Flask(__name__)
17
+
18
+ CORS(app)
19
+
20
+ logging.getLogger("numba").setLevel(logging.WARNING)
21
+
22
+
23
+ @app.route("/voiceChangeModel", methods=["POST"])
24
+ def voice_change_model():
25
+ request_form = request.form
26
+ wave_file = request.files.get("sample", None)
27
+ # get fSafePrefixPadLength
28
+ f_safe_prefix_pad_length = float(request_form.get("fSafePrefixPadLength", 0))
29
+ print("f_safe_prefix_pad_length:"+str(f_safe_prefix_pad_length))
30
+ # 变调信息
31
+ f_pitch_change = float(request_form.get("fPitchChange", 0))
32
+ # 获取spk_id
33
+ int_speak_id = int(request_form.get("sSpeakId", 0))
34
+ if enable_spk_id_cover:
35
+ int_speak_id = spk_id
36
+ # print("说话人:" + str(int_speak_id))
37
+ # DAW所需的采样率
38
+ daw_sample = int(float(request_form.get("sampleRate", 0)))
39
+ # http获得wav文件并转换
40
+ input_wav_read = io.BytesIO(wave_file.read())
41
+ # 模型推理
42
+ _audio, _model_sr = svc_model.infer(input_wav_read, f_pitch_change, int_speak_id, f_safe_prefix_pad_length)
43
+ tar_audio = librosa.resample(_audio, _model_sr, daw_sample)
44
+ # 返回音频
45
+ out_wav_path = io.BytesIO()
46
+ sf.write(out_wav_path, tar_audio, daw_sample, format="wav")
47
+ out_wav_path.seek(0)
48
+ return send_file(out_wav_path, download_name="temp.wav", as_attachment=True)
49
+
50
+
51
+ class SvcDDSP:
52
+ def __init__(self, model_path, vocoder_based_enhancer, enhancer_adaptive_key, input_pitch_extractor,
53
+ f0_min, f0_max, threhold, spk_id, spk_mix_dict, enable_spk_id_cover):
54
+ self.model_path = model_path
55
+ self.vocoder_based_enhancer = vocoder_based_enhancer
56
+ self.enhancer_adaptive_key = enhancer_adaptive_key
57
+ self.input_pitch_extractor = input_pitch_extractor
58
+ self.f0_min = f0_min
59
+ self.f0_max = f0_max
60
+ self.threhold = threhold
61
+ self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
62
+ self.spk_id = spk_id
63
+ self.spk_mix_dict = spk_mix_dict
64
+ self.enable_spk_id_cover = enable_spk_id_cover
65
+
66
+ # load ddsp model
67
+ self.model, self.args = load_model(self.model_path, device=self.device)
68
+
69
+ # load units encoder
70
+ self.units_encoder = Units_Encoder(
71
+ self.args.data.encoder,
72
+ self.args.data.encoder_ckpt,
73
+ self.args.data.encoder_sample_rate,
74
+ self.args.data.encoder_hop_size,
75
+ device=self.device)
76
+
77
+ # load enhancer
78
+ if self.vocoder_based_enhancer:
79
+ self.enhancer = Enhancer(self.args.enhancer.type, self.args.enhancer.ckpt, device=self.device)
80
+
81
+ def infer(self, input_wav, pitch_adjust, speaker_id, safe_prefix_pad_length):
82
+ print("Infer!")
83
+ # load input
84
+ audio, sample_rate = librosa.load(input_wav, sr=None, mono=True)
85
+ if len(audio.shape) > 1:
86
+ audio = librosa.to_mono(audio)
87
+ hop_size = self.args.data.block_size * sample_rate / self.args.data.sampling_rate
88
+
89
+ # safe front silence
90
+ if safe_prefix_pad_length > 0.03:
91
+ silence_front = safe_prefix_pad_length - 0.03
92
+ else:
93
+ silence_front = 0
94
+
95
+ # extract f0
96
+ pitch_extractor = F0_Extractor(
97
+ self.input_pitch_extractor,
98
+ sample_rate,
99
+ hop_size,
100
+ float(self.f0_min),
101
+ float(self.f0_max))
102
+ f0 = pitch_extractor.extract(audio, uv_interp=True, device=self.device, silence_front=silence_front)
103
+ f0 = torch.from_numpy(f0).float().to(self.device).unsqueeze(-1).unsqueeze(0)
104
+ f0 = f0 * 2 ** (float(pitch_adjust) / 12)
105
+
106
+ # extract volume
107
+ volume_extractor = Volume_Extractor(hop_size)
108
+ volume = volume_extractor.extract(audio)
109
+ mask = (volume > 10 ** (float(self.threhold) / 20)).astype('float')
110
+ mask = np.pad(mask, (4, 4), constant_values=(mask[0], mask[-1]))
111
+ mask = np.array([np.max(mask[n : n + 9]) for n in range(len(mask) - 8)])
112
+ mask = torch.from_numpy(mask).float().to(self.device).unsqueeze(-1).unsqueeze(0)
113
+ mask = upsample(mask, self.args.data.block_size).squeeze(-1)
114
+ volume = torch.from_numpy(volume).float().to(self.device).unsqueeze(-1).unsqueeze(0)
115
+
116
+ # extract units
117
+ audio_t = torch.from_numpy(audio).float().unsqueeze(0).to(self.device)
118
+ units = self.units_encoder.encode(audio_t, sample_rate, hop_size)
119
+
120
+ # spk_id or spk_mix_dict
121
+ if self.enable_spk_id_cover:
122
+ spk_id = self.spk_id
123
+ else:
124
+ spk_id = speaker_id
125
+ spk_id = torch.LongTensor(np.array([[spk_id]])).to(self.device)
126
+
127
+ # forward and return the output
128
+ with torch.no_grad():
129
+ output, _, (s_h, s_n) = self.model(units, f0, volume, spk_id = spk_id, spk_mix_dict = self.spk_mix_dict)
130
+ output *= mask
131
+ if self.vocoder_based_enhancer:
132
+ output, output_sample_rate = self.enhancer.enhance(
133
+ output,
134
+ self.args.data.sampling_rate,
135
+ f0,
136
+ self.args.data.block_size,
137
+ adaptive_key = self.enhancer_adaptive_key,
138
+ silence_front = silence_front)
139
+ else:
140
+ output_sample_rate = self.args.data.sampling_rate
141
+
142
+ output = output.squeeze().cpu().numpy()
143
+ return output, output_sample_rate
144
+
145
+
146
+ if __name__ == "__main__":
147
+ # ddsp-svc下只需传入下列参数。
148
+ # 对接的是串串香火锅大佬https://github.com/zhaohui8969/VST_NetProcess-。建议使用最新版本。
149
+ # flask部分来自diffsvc小狼大佬编写的代码。
150
+ # config和模型得同一目录。
151
+ checkpoint_path = "exp/multi_speaker/model_300000.pt"
152
+ # 是否使用预训练的基于声码器的增强器增强输出,但对硬件要求更高。
153
+ use_vocoder_based_enhancer = True
154
+ # 结合增强器使用,0为正常音域范围(最高G5)内的高音频质量,大于0则可以防止超高音破音
155
+ enhancer_adaptive_key = 0
156
+ # f0提取器,有parselmouth, dio, harvest, crepe
157
+ select_pitch_extractor = 'crepe'
158
+ # f0范围限制(Hz)
159
+ limit_f0_min = 50
160
+ limit_f0_max = 1100
161
+ # 音量响应阈值(dB)
162
+ threhold = -60
163
+ # 默认说话人。以及是否优先使用默认说话人覆盖vst传入的参数。
164
+ spk_id = 1
165
+ enable_spk_id_cover = True
166
+ # 混合说话人字典(捏音色功能)
167
+ # 设置为非 None 字典会覆盖 spk_id
168
+ spk_mix_dict = None # {1:0.5, 2:0.5} 表示1号说话人和2号说话人的音色按照0.5:0.5的比例混合
169
+ svc_model = SvcDDSP(checkpoint_path, use_vocoder_based_enhancer, enhancer_adaptive_key, select_pitch_extractor,
170
+ limit_f0_min, limit_f0_max, threhold, spk_id, spk_mix_dict, enable_spk_id_cover)
171
+
172
+ # 此处与vst插件对应,端口必须接上。
173
+ app.run(port=6844, host="0.0.0.0", debug=False, threaded=False)
gui.py ADDED
@@ -0,0 +1,299 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import PySimpleGUI as sg
2
+ import sounddevice as sd
3
+ import torch,librosa,threading,time
4
+ from enhancer import Enhancer
5
+ import numpy as np
6
+ from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
7
+ from ddsp.core import upsample
8
+
9
+
10
+ class SvcDDSP:
11
+ def __init__(self, model_path, vocoder_based_enhancer, enhancer_adaptive_key, input_pitch_extractor,
12
+ f0_min, f0_max, threhold, spk_id, spk_mix_dict, enable_spk_id_cover):
13
+ self.model_path = model_path
14
+ self.vocoder_based_enhancer = vocoder_based_enhancer
15
+ self.enhancer_adaptive_key = enhancer_adaptive_key
16
+ self.input_pitch_extractor = input_pitch_extractor
17
+ self.f0_min = f0_min
18
+ self.f0_max = f0_max
19
+ self.threhold = threhold
20
+ self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
21
+ self.spk_id = spk_id
22
+ self.spk_mix_dict = spk_mix_dict
23
+ self.enable_spk_id_cover = enable_spk_id_cover
24
+
25
+ # load ddsp model
26
+ self.model, self.args = load_model(self.model_path, device=self.device)
27
+
28
+ # load units encoder
29
+ self.units_encoder = Units_Encoder(
30
+ self.args.data.encoder,
31
+ self.args.data.encoder_ckpt,
32
+ self.args.data.encoder_sample_rate,
33
+ self.args.data.encoder_hop_size,
34
+ device=self.device)
35
+
36
+ # load enhancer
37
+ if self.vocoder_based_enhancer:
38
+ self.enhancer = Enhancer(self.args.enhancer.type, self.args.enhancer.ckpt, device=self.device)
39
+
40
+ def infer(self, pitch_adjust, speaker_id, safe_prefix_pad_length,audio,sample_rate):
41
+ print("Infering...")
42
+ # load input
43
+ #audio, sample_rate = librosa.load(input_wav, sr=None, mono=True)
44
+ hop_size = self.args.data.block_size * sample_rate / self.args.data.sampling_rate
45
+ # safe front silence
46
+ if safe_prefix_pad_length > 0.03:
47
+ silence_front = safe_prefix_pad_length - 0.03
48
+ else:
49
+ silence_front = 0
50
+
51
+ # extract f0
52
+ pitch_extractor = F0_Extractor(
53
+ self.input_pitch_extractor,
54
+ sample_rate,
55
+ hop_size,
56
+ float(self.f0_min),
57
+ float(self.f0_max))
58
+ f0 = pitch_extractor.extract(audio, uv_interp=True, device=self.device, silence_front=silence_front)
59
+ f0 = torch.from_numpy(f0).float().to(self.device).unsqueeze(-1).unsqueeze(0)
60
+ f0 = f0 * 2 ** (float(pitch_adjust) / 12)
61
+
62
+ # extract volume
63
+ volume_extractor = Volume_Extractor(hop_size)
64
+ volume = volume_extractor.extract(audio)
65
+ mask = (volume > 10 ** (float(self.threhold) / 20)).astype('float')
66
+ mask = np.pad(mask, (4, 4), constant_values=(mask[0], mask[-1]))
67
+ mask = np.array([np.max(mask[n : n + 9]) for n in range(len(mask) - 8)])
68
+ mask = torch.from_numpy(mask).float().to(self.device).unsqueeze(-1).unsqueeze(0)
69
+ mask = upsample(mask, self.args.data.block_size).squeeze(-1)
70
+ volume = torch.from_numpy(volume).float().to(self.device).unsqueeze(-1).unsqueeze(0)
71
+
72
+ # extract units
73
+ audio_t = torch.from_numpy(audio).float().unsqueeze(0).to(self.device)
74
+ units = self.units_encoder.encode(audio_t, sample_rate, hop_size)
75
+
76
+ # spk_id or spk_mix_dict
77
+ if self.enable_spk_id_cover:
78
+ spk_id = self.spk_id
79
+ else:
80
+ spk_id = speaker_id
81
+ spk_id = torch.LongTensor(np.array([[spk_id]])).to(self.device)
82
+
83
+ # forward and return the output
84
+ with torch.no_grad():
85
+ output, _, (s_h, s_n) = self.model(units, f0, volume, spk_id = spk_id, spk_mix_dict = self.spk_mix_dict)
86
+ output *= mask
87
+ if self.vocoder_based_enhancer:
88
+ output, output_sample_rate = self.enhancer.enhance(
89
+ output,
90
+ self.args.data.sampling_rate,
91
+ f0,
92
+ self.args.data.block_size,
93
+ adaptive_key = self.enhancer_adaptive_key,
94
+ silence_front = silence_front)
95
+ else:
96
+ output_sample_rate = self.args.data.sampling_rate
97
+
98
+ output = output.squeeze().cpu().numpy()
99
+ return output, output_sample_rate
100
+
101
+
102
+
103
+
104
+ class GUI:
105
+ def __init__(self) -> None:
106
+ self.flag_vc:bool=False#变声线程flag
107
+ self.samplerate=44100#Hz
108
+ self.block_time=1.5#s
109
+ self.block_frame=0
110
+ self.crossfade_frame=0
111
+ self.fade_in_window:np.ndarray=None#crossfade计算用numpy数组
112
+ self.fade_out_window:np.ndarray=None#crossfade计算用numpy数组
113
+ self.f_safe_prefix_pad_length:float = 1.0
114
+ self.input_wav:np.ndarray=None#输入音频规范化后的保存地址
115
+ self.output_wav:np.ndarray=None#输出音频规范化后的保存地址
116
+ self.temp_wav:np.ndarray=None#包含crossfade和输出音频的缓存区
117
+ self.f_pitch_change:float = 0.0#float(request_form.get("fPitchChange", 0))
118
+ self.crossfade_last:np.ndarray=None#保存上一个output的crossfade
119
+ self.f0_mode=["parselmouth", "dio", "harvest", "crepe"]#F0预测器
120
+ self.spk_id = 1# 默认说话人。
121
+ self.svc_model:SvcDDSP = None
122
+ self.launcher()#start
123
+ # 混合说话人字典(捏音色功能)
124
+ # 设置为非 None 字典会覆盖 spk_id
125
+ self.spk_mix_dict = None # {1:0.5, 2:0.5} 表示1号说话人和2号说话人的音色按照0.5:0.5的比例混合
126
+ self.use_vocoder_based_enhancer = True
127
+
128
+
129
+ def launcher(self):
130
+ '''窗口加载'''
131
+ input_devices,output_devices,_, _=self.get_devices()
132
+ sg.theme('DarkAmber') # 设置主题
133
+ # 界面布局
134
+ layout = [
135
+ [ sg.Frame(layout=[
136
+ [sg.Input(key='sg_model',default_text='exp\\model_chino.pt'),sg.FileBrowse('选择模型文件')]
137
+ ],title='模型.pt格式(自动识别同目录下config.yaml)')
138
+ ],
139
+ [ sg.Frame(layout=[
140
+ [sg.Text("输入设备"),sg.Combo(input_devices,key='sg_input_device',default_value=input_devices[sd.default.device[0]])],
141
+ [sg.Text("输出设备"),sg.Combo(output_devices,key='sg_output_device',default_value=output_devices[sd.default.device[1]])]
142
+ ],title='音频设备')
143
+ ],
144
+ [ sg.Frame(layout=[
145
+ [sg.Text("说话人id"),sg.Input(key='spk_id',default_text='1')],
146
+ [sg.Text("响应阈值"),sg.Slider(range=(-60,0),orientation='h',key='noise',resolution=1,default_value=-35)],
147
+ [sg.Text("变调"),sg.Slider(range=(-24,24),orientation='h',key='pitch',resolution=1,default_value=12)],
148
+ [sg.Text("采样率"),sg.Input(key='samplerate',default_text='44100')],
149
+ [sg.Checkbox(text='启用捏音色功能',default=False,key='spk_mix'),sg.Button("设置混合音色",key='set_spk_mix')]
150
+ ],title='普通设置'),
151
+ sg.Frame(layout=[
152
+ [sg.Text("音频切分大小"),sg.Slider(range=(0.1,3.0),orientation='h',key='block',resolution=0.05,default_value=0.5)],
153
+ [sg.Text("交叉淡化时长"),sg.Slider(range=(0.02,0.1),orientation='h',key='crossfade',resolution=0.01)],
154
+ [sg.Text("使用历史区块数量"),sg.Slider(range=(1,10),orientation='h',key='buffernum',resolution=1,default_value=2)],
155
+ [sg.Text("f0预测模式"),sg.Combo(values=self.f0_mode,key='f0_mode',default_value=self.f0_mode[2])],
156
+ [sg.Checkbox(text='启用增强器',default=True,key='use_enhancer')]
157
+ ],title='性能设置'),
158
+ ],
159
+ [sg.Button("开始音频转换",key="start_vc"),sg.Button("停止音频转换",key="stop_vc")]
160
+ ]
161
+
162
+ # 创造窗口
163
+ window = sg.Window('DDSP - GUI by INT16', layout)
164
+ self.event_handler(window=window)
165
+
166
+
167
+ def event_handler(self,window):
168
+ '''事件处理'''
169
+ while True:#事件处理循环
170
+ event, values = window.read()
171
+ if event ==sg.WINDOW_CLOSED: # 如果用户关闭窗口
172
+ self.flag_vc=False
173
+ exit()
174
+ if event=='start_vc' and self.flag_vc==False:
175
+ #set values 和界面布局layout顺序一一对应
176
+ checkpoint_path = values['sg_model']
177
+ self.set_devices(values["sg_input_device"],values['sg_output_device'])
178
+ self.spk_id=int(values['spk_id'])
179
+ threhold = values['noise']
180
+ self.f_pitch_change = values['pitch']
181
+ self.samplerate=int(values['samplerate'])
182
+ block_time = float(values['block'])
183
+ crossfade_time = values['crossfade']
184
+ buffer_num = int(values['buffernum'])
185
+ select_pitch_extractor=values['f0_mode']
186
+ self.use_vocoder_based_enhancer=values['use_enhancer']
187
+ if not values['spk_mix']:
188
+ self.spk_mix_dict=None
189
+ self.block_frame=int(block_time*self.samplerate)
190
+ self.crossfade_frame=int(crossfade_time*self.samplerate)
191
+ self.f_safe_prefix_pad_length=block_time*(buffer_num)-crossfade_time*2
192
+ print('crossfade_time:'+str(crossfade_time))
193
+ print("buffer_num:"+str(buffer_num))
194
+ print("samplerate:"+str(self.samplerate))
195
+ print('block_time:'+str(block_time))
196
+ print("prefix_pad_length:"+str(self.f_safe_prefix_pad_length))
197
+ print("mix_mode:"+str(self.spk_mix_dict))
198
+ print("enhancer:"+str(self.use_vocoder_based_enhancer))
199
+ self.start_vc(checkpoint_path,select_pitch_extractor,threhold,buffer_num)
200
+ if event=='stop_vc'and self.flag_vc==True:
201
+ self.flag_vc = False
202
+ if event=='set_spk_mix' and self.flag_vc==False:
203
+ spk_mix = sg.popup_get_text(message='示例:1:0.3,2:0.5,3:0.2',title="设置混合音色,支持多人")
204
+ if spk_mix != None:
205
+ self.spk_mix_dict=eval("{"+spk_mix.replace(',',',').replace(':',':')+"}")
206
+
207
+
208
+ def start_vc(self,checkpoint_path,select_pitch_extractor,threhold,buffer_num):
209
+ '''开始音频转换'''
210
+ self.flag_vc = True
211
+ # 是否使用预训练的基于声码器的增强器增强输出,但对硬件要求更高。
212
+
213
+ enhancer_adaptive_key = 0
214
+ # f0范围限制(Hz)
215
+ limit_f0_min = 50
216
+ limit_f0_max = 1100
217
+ enable_spk_id_cover = True
218
+ #初始化一下各个ndarray
219
+ self.input_wav=np.zeros(int((1+buffer_num)*self.block_frame),dtype='float32')
220
+ self.output_wav=np.zeros(self.block_frame,dtype='float32')
221
+ self.temp_wav=np.zeros(self.block_frame+self.crossfade_frame,dtype='float32')
222
+ self.crossfade_last=np.zeros(self.crossfade_frame,dtype='float32')
223
+ self.fade_in_window = np.linspace(0, 1,self.crossfade_frame)
224
+ self.fade_out_window = np.linspace(1, 0,self.crossfade_frame)
225
+ self.svc_model = SvcDDSP(checkpoint_path, self.use_vocoder_based_enhancer, enhancer_adaptive_key, select_pitch_extractor,limit_f0_min, limit_f0_max, threhold, self.spk_id, self.spk_mix_dict, enable_spk_id_cover)
226
+ thread_vc=threading.Thread(target=self.soundinput)
227
+ thread_vc.start()
228
+
229
+
230
+ def soundinput(self):
231
+ '''
232
+ 接受音频输入
233
+ '''
234
+ with sd.Stream(callback=self.audio_callback, blocksize=self.block_frame,samplerate=self.samplerate,dtype='float32'):
235
+ while self.flag_vc:
236
+ time.sleep(self.block_time)
237
+ print('Audio block passed.')
238
+ print('ENDing VC')
239
+
240
+
241
+ def audio_callback(self,indata,outdata, frames, time, status):
242
+ '''
243
+ 音频处理
244
+ '''
245
+ print("Realtime VCing...")
246
+ self.input_wav[:]=np.roll(self.input_wav,-self.block_frame)
247
+ self.input_wav[-self.block_frame:]=librosa.to_mono(indata.T)
248
+ print('input_wav.shape:'+str(self.input_wav.shape))
249
+ _audio, _model_sr = self.svc_model.infer( self.f_pitch_change, self.spk_id, self.f_safe_prefix_pad_length,self.input_wav,self.samplerate)
250
+ self.temp_wav[:] = librosa.resample(_audio, orig_sr=_model_sr, target_sr=self.samplerate)[-self.block_frame-self.crossfade_frame:]
251
+ #cross-fade output_wav's start with last crossfade
252
+ self.output_wav[:]=self.temp_wav[:self.block_frame]
253
+ self.output_wav[:self.crossfade_frame]*=self.fade_in_window
254
+ self.output_wav[:self.crossfade_frame]+=self.crossfade_last
255
+ self.crossfade_last[:]=self.temp_wav[-self.crossfade_frame:]
256
+ self.crossfade_last[:]*=self.fade_out_window
257
+ print("infered _audio.shape:"+str(_audio.shape))
258
+ outdata[:] = np.array([self.output_wav, self.output_wav]).T
259
+ print('Outputed.')
260
+
261
+
262
+ def get_devices(self,update: bool = True):
263
+ '''获取设备列表'''
264
+ if update:
265
+ sd._terminate()
266
+ sd._initialize()
267
+ devices = sd.query_devices()
268
+ hostapis = sd.query_hostapis()
269
+ for hostapi in hostapis:
270
+ for device_idx in hostapi["devices"]:
271
+ devices[device_idx]["hostapi_name"] = hostapi["name"]
272
+ input_devices = [
273
+ f"{d['name']} ({d['hostapi_name']})"
274
+ for d in devices
275
+ if d["max_input_channels"] > 0
276
+ ]
277
+ output_devices = [
278
+ f"{d['name']} ({d['hostapi_name']})"
279
+ for d in devices
280
+ if d["max_output_channels"] > 0
281
+ ]
282
+ input_devices_indices = [d["index"] for d in devices if d["max_input_channels"] > 0]
283
+ output_devices_indices = [
284
+ d["index"] for d in devices if d["max_output_channels"] > 0
285
+ ]
286
+ return input_devices, output_devices, input_devices_indices, output_devices_indices
287
+
288
+ def set_devices(self,input_device,output_device):
289
+ '''设置输出设备'''
290
+ input_devices,output_devices,input_device_indices, output_device_indices=self.get_devices()
291
+ sd.default.device[0]=input_device_indices[input_devices.index(input_device)]
292
+ sd.default.device[1]=output_device_indices[output_devices.index(output_device)]
293
+ print("input device:"+str(sd.default.device[0])+":"+str(input_device))
294
+ print("output device:"+str(sd.default.device[1])+":"+str(output_device))
295
+
296
+
297
+
298
+ if __name__ == "__main__":
299
+ gui=GUI()
logger/__init__.py ADDED
File without changes
logger/saver.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ '''
2
+ author: wayn391@mastertones
3
+ '''
4
+
5
+ import os
6
+ import json
7
+ import time
8
+ import yaml
9
+ import datetime
10
+ import torch
11
+
12
+ from . import utils
13
+ from torch.utils.tensorboard import SummaryWriter
14
+
15
+ class Saver(object):
16
+ def __init__(
17
+ self,
18
+ args,
19
+ initial_global_step=-1):
20
+
21
+ self.expdir = args.env.expdir
22
+ self.sample_rate = args.data.sampling_rate
23
+
24
+ # cold start
25
+ self.global_step = initial_global_step
26
+ self.init_time = time.time()
27
+ self.last_time = time.time()
28
+
29
+ # makedirs
30
+ os.makedirs(self.expdir, exist_ok=True)
31
+
32
+ # path
33
+ self.path_log_info = os.path.join(self.expdir, 'log_info.txt')
34
+
35
+ # ckpt
36
+ os.makedirs(self.expdir, exist_ok=True)
37
+
38
+ # writer
39
+ self.writer = SummaryWriter(os.path.join(self.expdir, 'logs'))
40
+
41
+ # save config
42
+ path_config = os.path.join(self.expdir, 'config.yaml')
43
+ with open(path_config, "w") as out_config:
44
+ yaml.dump(dict(args), out_config)
45
+
46
+
47
+ def log_info(self, msg):
48
+ '''log method'''
49
+ if isinstance(msg, dict):
50
+ msg_list = []
51
+ for k, v in msg.items():
52
+ tmp_str = ''
53
+ if isinstance(v, int):
54
+ tmp_str = '{}: {:,}'.format(k, v)
55
+ else:
56
+ tmp_str = '{}: {}'.format(k, v)
57
+
58
+ msg_list.append(tmp_str)
59
+ msg_str = '\n'.join(msg_list)
60
+ else:
61
+ msg_str = msg
62
+
63
+ # dsplay
64
+ print(msg_str)
65
+
66
+ # save
67
+ with open(self.path_log_info, 'a') as fp:
68
+ fp.write(msg_str+'\n')
69
+
70
+ def log_value(self, dict):
71
+ for k, v in dict.items():
72
+ self.writer.add_scalar(k, v, self.global_step)
73
+
74
+ def log_audio(self, dict):
75
+ for k, v in dict.items():
76
+ self.writer.add_audio(k, v, global_step=self.global_step, sample_rate=self.sample_rate)
77
+
78
+ def get_interval_time(self, update=True):
79
+ cur_time = time.time()
80
+ time_interval = cur_time - self.last_time
81
+ if update:
82
+ self.last_time = cur_time
83
+ return time_interval
84
+
85
+ def get_total_time(self, to_str=True):
86
+ total_time = time.time() - self.init_time
87
+ if to_str:
88
+ total_time = str(datetime.timedelta(
89
+ seconds=total_time))[:-5]
90
+ return total_time
91
+
92
+ def save_model(
93
+ self,
94
+ model,
95
+ optimizer,
96
+ name='model',
97
+ postfix='',
98
+ to_json=False):
99
+ # path
100
+ if postfix:
101
+ postfix = '_' + postfix
102
+ path_pt = os.path.join(
103
+ self.expdir , name+postfix+'.pt')
104
+
105
+ # check
106
+ print(' [*] model checkpoint saved: {}'.format(path_pt))
107
+
108
+ # save
109
+ torch.save({
110
+ 'global_step': self.global_step,
111
+ 'model': model.state_dict(),
112
+ 'optimizer': optimizer.state_dict()}, path_pt)
113
+
114
+ # to json
115
+ if to_json:
116
+ path_json = os.path.join(
117
+ self.expdir , name+'.json')
118
+ utils.to_json(path_params, path_json)
119
+
120
+ def global_step_increment(self):
121
+ self.global_step += 1
122
+
123
+
logger/utils.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import yaml
3
+ import json
4
+ import pickle
5
+ import torch
6
+
7
+ def traverse_dir(
8
+ root_dir,
9
+ extension,
10
+ amount=None,
11
+ str_include=None,
12
+ str_exclude=None,
13
+ is_pure=False,
14
+ is_sort=False,
15
+ is_ext=True):
16
+
17
+ file_list = []
18
+ cnt = 0
19
+ for root, _, files in os.walk(root_dir):
20
+ for file in files:
21
+ if file.endswith(extension):
22
+ # path
23
+ mix_path = os.path.join(root, file)
24
+ pure_path = mix_path[len(root_dir)+1:] if is_pure else mix_path
25
+
26
+ # amount
27
+ if (amount is not None) and (cnt == amount):
28
+ if is_sort:
29
+ file_list.sort()
30
+ return file_list
31
+
32
+ # check string
33
+ if (str_include is not None) and (str_include not in pure_path):
34
+ continue
35
+ if (str_exclude is not None) and (str_exclude in pure_path):
36
+ continue
37
+
38
+ if not is_ext:
39
+ ext = pure_path.split('.')[-1]
40
+ pure_path = pure_path[:-(len(ext)+1)]
41
+ file_list.append(pure_path)
42
+ cnt += 1
43
+ if is_sort:
44
+ file_list.sort()
45
+ return file_list
46
+
47
+
48
+
49
+ class DotDict(dict):
50
+ def __getattr__(*args):
51
+ val = dict.get(*args)
52
+ return DotDict(val) if type(val) is dict else val
53
+
54
+ __setattr__ = dict.__setitem__
55
+ __delattr__ = dict.__delitem__
56
+
57
+
58
+ def get_network_paras_amount(model_dict):
59
+ info = dict()
60
+ for model_name, model in model_dict.items():
61
+ # all_params = sum(p.numel() for p in model.parameters())
62
+ trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
63
+
64
+ info[model_name] = trainable_params
65
+ return info
66
+
67
+
68
+ def load_config(path_config):
69
+ with open(path_config, "r") as config:
70
+ args = yaml.safe_load(config)
71
+ args = DotDict(args)
72
+ # print(args)
73
+ return args
74
+
75
+
76
+ def to_json(path_params, path_json):
77
+ params = torch.load(path_params, map_location=torch.device('cpu'))
78
+ raw_state_dict = {}
79
+ for k, v in params.items():
80
+ val = v.flatten().numpy().tolist()
81
+ raw_state_dict[k] = val
82
+
83
+ with open(path_json, 'w') as outfile:
84
+ json.dump(raw_state_dict, outfile,indent= "\t")
85
+
86
+
87
+ def convert_tensor_to_numpy(tensor, is_squeeze=True):
88
+ if is_squeeze:
89
+ tensor = tensor.squeeze()
90
+ if tensor.requires_grad:
91
+ tensor = tensor.detach()
92
+ if tensor.is_cuda:
93
+ tensor = tensor.cpu()
94
+ return tensor.numpy()
95
+
96
+
97
+ def load_model(
98
+ expdir,
99
+ model,
100
+ optimizer,
101
+ name='model',
102
+ postfix='',
103
+ device='cpu'):
104
+ if postfix == '':
105
+ postfix = '_' + postfix
106
+ path = os.path.join(expdir, name+postfix)
107
+ path_pt = traverse_dir(expdir, '.pt', is_ext=False)
108
+ global_step = 0
109
+ if len(path_pt) > 0:
110
+ steps = [s[len(path):] for s in path_pt]
111
+ maxstep = max([int(s) if s.isdigit() else 0 for s in steps])
112
+ if maxstep > 0:
113
+ path_pt = path+str(maxstep)+'.pt'
114
+ else:
115
+ path_pt = path+'best.pt'
116
+ print(' [*] restoring model from', path_pt)
117
+ ckpt = torch.load(path_pt, map_location=torch.device(device))
118
+ global_step = ckpt['global_step']
119
+ model.load_state_dict(ckpt['model'])
120
+ optimizer.load_state_dict(ckpt['optimizer'])
121
+ return global_step, model, optimizer
main.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ import librosa
4
+ import argparse
5
+ import numpy as np
6
+ import soundfile as sf
7
+ import pyworld as pw
8
+ import parselmouth
9
+ from ast import literal_eval
10
+ from slicer import Slicer
11
+ from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
12
+ from ddsp.core import upsample
13
+ from enhancer import Enhancer
14
+ from tqdm import tqdm
15
+
16
+ def parse_args(args=None, namespace=None):
17
+ """Parse command-line arguments."""
18
+ parser = argparse.ArgumentParser()
19
+ parser.add_argument(
20
+ "-m",
21
+ "--model_path",
22
+ type=str,
23
+ required=True,
24
+ help="path to the model file",
25
+ )
26
+ parser.add_argument(
27
+ "-i",
28
+ "--input",
29
+ type=str,
30
+ required=True,
31
+ help="path to the input audio file",
32
+ )
33
+ parser.add_argument(
34
+ "-o",
35
+ "--output",
36
+ type=str,
37
+ required=True,
38
+ help="path to the output audio file",
39
+ )
40
+ parser.add_argument(
41
+ "-id",
42
+ "--spk_id",
43
+ type=str,
44
+ required=False,
45
+ default=1,
46
+ help="speaker id (for multi-speaker model) | default: 1",
47
+ )
48
+ parser.add_argument(
49
+ "-mix",
50
+ "--spk_mix_dict",
51
+ type=str,
52
+ required=False,
53
+ default="None",
54
+ help="mix-speaker dictionary (for multi-speaker model) | default: None",
55
+ )
56
+ parser.add_argument(
57
+ "-k",
58
+ "--key",
59
+ type=str,
60
+ required=False,
61
+ default=0,
62
+ help="key changed (number of semitones) | default: 0",
63
+ )
64
+ parser.add_argument(
65
+ "-e",
66
+ "--enhance",
67
+ type=str,
68
+ required=False,
69
+ default='true',
70
+ help="true or false | default: true",
71
+ )
72
+ parser.add_argument(
73
+ "-pe",
74
+ "--pitch_extractor",
75
+ type=str,
76
+ required=False,
77
+ default='crepe',
78
+ help="pitch extrator type: parselmouth, dio, harvest, crepe (default)",
79
+ )
80
+ parser.add_argument(
81
+ "-fmin",
82
+ "--f0_min",
83
+ type=str,
84
+ required=False,
85
+ default=50,
86
+ help="min f0 (Hz) | default: 50",
87
+ )
88
+ parser.add_argument(
89
+ "-fmax",
90
+ "--f0_max",
91
+ type=str,
92
+ required=False,
93
+ default=1100,
94
+ help="max f0 (Hz) | default: 1100",
95
+ )
96
+ parser.add_argument(
97
+ "-th",
98
+ "--threhold",
99
+ type=str,
100
+ required=False,
101
+ default=-60,
102
+ help="response threhold (dB) | default: -60",
103
+ )
104
+ parser.add_argument(
105
+ "-eak",
106
+ "--enhancer_adaptive_key",
107
+ type=str,
108
+ required=False,
109
+ default=0,
110
+ help="adapt the enhancer to a higher vocal range (number of semitones) | default: 0",
111
+ )
112
+ return parser.parse_args(args=args, namespace=namespace)
113
+
114
+
115
+ def split(audio, sample_rate, hop_size, db_thresh = -40, min_len = 5000):
116
+ slicer = Slicer(
117
+ sr=sample_rate,
118
+ threshold=db_thresh,
119
+ min_length=min_len)
120
+ chunks = dict(slicer.slice(audio))
121
+ result = []
122
+ for k, v in chunks.items():
123
+ tag = v["split_time"].split(",")
124
+ if tag[0] != tag[1]:
125
+ start_frame = int(int(tag[0]) // hop_size)
126
+ end_frame = int(int(tag[1]) // hop_size)
127
+ if end_frame > start_frame:
128
+ result.append((
129
+ start_frame,
130
+ audio[int(start_frame * hop_size) : int(end_frame * hop_size)]))
131
+ return result
132
+
133
+
134
+ def cross_fade(a: np.ndarray, b: np.ndarray, idx: int):
135
+ result = np.zeros(idx + b.shape[0])
136
+ fade_len = a.shape[0] - idx
137
+ np.copyto(dst=result[:idx], src=a[:idx])
138
+ k = np.linspace(0, 1.0, num=fade_len, endpoint=True)
139
+ result[idx: a.shape[0]] = (1 - k) * a[idx:] + k * b[: fade_len]
140
+ np.copyto(dst=result[a.shape[0]:], src=b[fade_len:])
141
+ return result
142
+
143
+
144
+ if __name__ == '__main__':
145
+ #device = 'cpu'
146
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
147
+
148
+ # parse commands
149
+ cmd = parse_args()
150
+
151
+ # load ddsp model
152
+ model, args = load_model(cmd.model_path, device=device)
153
+
154
+ # load input
155
+ audio, sample_rate = librosa.load(cmd.input, sr=None)
156
+ if len(audio.shape) > 1:
157
+ audio = librosa.to_mono(audio)
158
+ hop_size = args.data.block_size * sample_rate / args.data.sampling_rate
159
+
160
+ # extract f0
161
+ print('Pitch extractor type: ' + cmd.pitch_extractor)
162
+ pitch_extractor = F0_Extractor(
163
+ cmd.pitch_extractor,
164
+ sample_rate,
165
+ hop_size,
166
+ float(cmd.f0_min),
167
+ float(cmd.f0_max))
168
+ print('Extracting the pitch curve of the input audio...')
169
+ f0 = pitch_extractor.extract(audio, uv_interp = True, device = device)
170
+ f0 = torch.from_numpy(f0).float().to(device).unsqueeze(-1).unsqueeze(0)
171
+
172
+ # key change
173
+ f0 = f0 * 2 ** (float(cmd.key) / 12)
174
+
175
+ # extract volume
176
+ print('Extracting the volume envelope of the input audio...')
177
+ volume_extractor = Volume_Extractor(hop_size)
178
+ volume = volume_extractor.extract(audio)
179
+ mask = (volume > 10 ** (float(cmd.threhold) / 20)).astype('float')
180
+ mask = np.pad(mask, (4, 4), constant_values=(mask[0], mask[-1]))
181
+ mask = np.array([np.max(mask[n : n + 9]) for n in range(len(mask) - 8)])
182
+ mask = torch.from_numpy(mask).float().to(device).unsqueeze(-1).unsqueeze(0)
183
+ mask = upsample(mask, args.data.block_size).squeeze(-1)
184
+ volume = torch.from_numpy(volume).float().to(device).unsqueeze(-1).unsqueeze(0)
185
+
186
+ # load units encoder
187
+ units_encoder = Units_Encoder(
188
+ args.data.encoder,
189
+ args.data.encoder_ckpt,
190
+ args.data.encoder_sample_rate,
191
+ args.data.encoder_hop_size,
192
+ device = device)
193
+
194
+ # load enhancer
195
+ if cmd.enhance == 'true':
196
+ print('Enhancer type: ' + args.enhancer.type)
197
+ enhancer = Enhancer(args.enhancer.type, args.enhancer.ckpt, device=device)
198
+ else:
199
+ print('Enhancer type: none (using raw output of ddsp)')
200
+
201
+ # speaker id or mix-speaker dictionary
202
+ spk_mix_dict = literal_eval(cmd.spk_mix_dict)
203
+ if spk_mix_dict is not None:
204
+ print('Mix-speaker mode')
205
+ else:
206
+ print('Speaker ID: '+ str(int(cmd.spk_id)))
207
+ spk_id = torch.LongTensor(np.array([[int(cmd.spk_id)]])).to(device)
208
+ # forward and save the output
209
+ result = np.zeros(0)
210
+ current_length = 0
211
+ segments = split(audio, sample_rate, hop_size)
212
+ print('Cut the input audio into ' + str(len(segments)) + ' slices')
213
+ with torch.no_grad():
214
+ for segment in tqdm(segments):
215
+ start_frame = segment[0]
216
+ seg_input = torch.from_numpy(segment[1]).float().unsqueeze(0).to(device)
217
+ seg_units = units_encoder.encode(seg_input, sample_rate, hop_size)
218
+
219
+ seg_f0 = f0[:, start_frame : start_frame + seg_units.size(1), :]
220
+ seg_volume = volume[:, start_frame : start_frame + seg_units.size(1), :]
221
+
222
+ seg_output, _, (s_h, s_n) = model(seg_units, seg_f0, seg_volume, spk_id = spk_id, spk_mix_dict = spk_mix_dict)
223
+ seg_output *= mask[:, start_frame * args.data.block_size : (start_frame + seg_units.size(1)) * args.data.block_size]
224
+
225
+ if cmd.enhance == 'true':
226
+ seg_output, output_sample_rate = enhancer.enhance(
227
+ seg_output,
228
+ args.data.sampling_rate,
229
+ seg_f0,
230
+ args.data.block_size,
231
+ adaptive_key = float(cmd.enhancer_adaptive_key))
232
+ else:
233
+ output_sample_rate = args.data.sampling_rate
234
+
235
+ seg_output = seg_output.squeeze().cpu().numpy()
236
+
237
+ silent_length = round(start_frame * args.data.block_size * output_sample_rate / args.data.sampling_rate) - current_length
238
+ if silent_length >= 0:
239
+ result = np.append(result, np.zeros(silent_length))
240
+ result = np.append(result, seg_output)
241
+ else:
242
+ result = cross_fade(result, seg_output, current_length + silent_length)
243
+ current_length = current_length + silent_length + len(seg_output)
244
+ sf.write(cmd.output, result, output_sample_rate)
245
+
nsf_hifigan/env.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import shutil
3
+
4
+
5
+ class AttrDict(dict):
6
+ def __init__(self, *args, **kwargs):
7
+ super(AttrDict, self).__init__(*args, **kwargs)
8
+ self.__dict__ = self
9
+
10
+
11
+ def build_env(config, config_name, path):
12
+ t_path = os.path.join(path, config_name)
13
+ if config != t_path:
14
+ os.makedirs(path, exist_ok=True)
15
+ shutil.copyfile(config, os.path.join(path, config_name))
nsf_hifigan/models.py ADDED
@@ -0,0 +1,435 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ from .env import AttrDict
4
+ import numpy as np
5
+ import torch
6
+ import torch.nn.functional as F
7
+ import torch.nn as nn
8
+ from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
9
+ from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
10
+ from .utils import init_weights, get_padding
11
+
12
+ LRELU_SLOPE = 0.1
13
+
14
+
15
+ def load_model(model_path, device='cuda'):
16
+ config_file = os.path.join(os.path.split(model_path)[0], 'config.json')
17
+ with open(config_file) as f:
18
+ data = f.read()
19
+
20
+ json_config = json.loads(data)
21
+ h = AttrDict(json_config)
22
+
23
+ generator = Generator(h).to(device)
24
+
25
+ cp_dict = torch.load(model_path, map_location=device)
26
+ generator.load_state_dict(cp_dict['generator'])
27
+ generator.eval()
28
+ generator.remove_weight_norm()
29
+ del cp_dict
30
+ return generator, h
31
+
32
+
33
+ class ResBlock1(torch.nn.Module):
34
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
35
+ super(ResBlock1, self).__init__()
36
+ self.h = h
37
+ self.convs1 = nn.ModuleList([
38
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
39
+ padding=get_padding(kernel_size, dilation[0]))),
40
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
41
+ padding=get_padding(kernel_size, dilation[1]))),
42
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
43
+ padding=get_padding(kernel_size, dilation[2])))
44
+ ])
45
+ self.convs1.apply(init_weights)
46
+
47
+ self.convs2 = nn.ModuleList([
48
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
49
+ padding=get_padding(kernel_size, 1))),
50
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
51
+ padding=get_padding(kernel_size, 1))),
52
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
53
+ padding=get_padding(kernel_size, 1)))
54
+ ])
55
+ self.convs2.apply(init_weights)
56
+
57
+ def forward(self, x):
58
+ for c1, c2 in zip(self.convs1, self.convs2):
59
+ xt = F.leaky_relu(x, LRELU_SLOPE)
60
+ xt = c1(xt)
61
+ xt = F.leaky_relu(xt, LRELU_SLOPE)
62
+ xt = c2(xt)
63
+ x = xt + x
64
+ return x
65
+
66
+ def remove_weight_norm(self):
67
+ for l in self.convs1:
68
+ remove_weight_norm(l)
69
+ for l in self.convs2:
70
+ remove_weight_norm(l)
71
+
72
+
73
+ class ResBlock2(torch.nn.Module):
74
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
75
+ super(ResBlock2, self).__init__()
76
+ self.h = h
77
+ self.convs = nn.ModuleList([
78
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
79
+ padding=get_padding(kernel_size, dilation[0]))),
80
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
81
+ padding=get_padding(kernel_size, dilation[1])))
82
+ ])
83
+ self.convs.apply(init_weights)
84
+
85
+ def forward(self, x):
86
+ for c in self.convs:
87
+ xt = F.leaky_relu(x, LRELU_SLOPE)
88
+ xt = c(xt)
89
+ x = xt + x
90
+ return x
91
+
92
+ def remove_weight_norm(self):
93
+ for l in self.convs:
94
+ remove_weight_norm(l)
95
+
96
+
97
+ class SineGen(torch.nn.Module):
98
+ """ Definition of sine generator
99
+ SineGen(samp_rate, harmonic_num = 0,
100
+ sine_amp = 0.1, noise_std = 0.003,
101
+ voiced_threshold = 0,
102
+ flag_for_pulse=False)
103
+ samp_rate: sampling rate in Hz
104
+ harmonic_num: number of harmonic overtones (default 0)
105
+ sine_amp: amplitude of sine-wavefrom (default 0.1)
106
+ noise_std: std of Gaussian noise (default 0.003)
107
+ voiced_thoreshold: F0 threshold for U/V classification (default 0)
108
+ flag_for_pulse: this SinGen is used inside PulseGen (default False)
109
+ Note: when flag_for_pulse is True, the first time step of a voiced
110
+ segment is always sin(np.pi) or cos(0)
111
+ """
112
+
113
+ def __init__(self, samp_rate, harmonic_num=0,
114
+ sine_amp=0.1, noise_std=0.003,
115
+ voiced_threshold=0):
116
+ super(SineGen, self).__init__()
117
+ self.sine_amp = sine_amp
118
+ self.noise_std = noise_std
119
+ self.harmonic_num = harmonic_num
120
+ self.dim = self.harmonic_num + 1
121
+ self.sampling_rate = samp_rate
122
+ self.voiced_threshold = voiced_threshold
123
+
124
+ def _f02uv(self, f0):
125
+ # generate uv signal
126
+ uv = torch.ones_like(f0)
127
+ uv = uv * (f0 > self.voiced_threshold)
128
+ return uv
129
+
130
+ @torch.no_grad()
131
+ def forward(self, f0, upp):
132
+ """ sine_tensor, uv = forward(f0)
133
+ input F0: tensor(batchsize=1, length, dim=1)
134
+ f0 for unvoiced steps should be 0
135
+ output sine_tensor: tensor(batchsize=1, length, dim)
136
+ output uv: tensor(batchsize=1, length, 1)
137
+ """
138
+ f0 = f0.unsqueeze(-1)
139
+ fn = torch.multiply(f0, torch.arange(1, self.dim + 1, device=f0.device).reshape((1, 1, -1)))
140
+ rad_values = (fn / self.sampling_rate) % 1 ###%1意味着n_har的乘积无法后处理优化
141
+ rand_ini = torch.rand(fn.shape[0], fn.shape[2], device=fn.device)
142
+ rand_ini[:, 0] = 0
143
+ rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
144
+ is_half = rad_values.dtype is not torch.float32
145
+ tmp_over_one = torch.cumsum(rad_values.double(), 1) # % 1 #####%1意味着后面的cumsum无法再优化
146
+ if is_half:
147
+ tmp_over_one = tmp_over_one.half()
148
+ else:
149
+ tmp_over_one = tmp_over_one.float()
150
+ tmp_over_one *= upp
151
+ tmp_over_one = F.interpolate(
152
+ tmp_over_one.transpose(2, 1), scale_factor=upp,
153
+ mode='linear', align_corners=True
154
+ ).transpose(2, 1)
155
+ rad_values = F.interpolate(rad_values.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
156
+ tmp_over_one %= 1
157
+ tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
158
+ cumsum_shift = torch.zeros_like(rad_values)
159
+ cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
160
+ rad_values = rad_values.double()
161
+ cumsum_shift = cumsum_shift.double()
162
+ sine_waves = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi)
163
+ if is_half:
164
+ sine_waves = sine_waves.half()
165
+ else:
166
+ sine_waves = sine_waves.float()
167
+ sine_waves = sine_waves * self.sine_amp
168
+ uv = self._f02uv(f0)
169
+ uv = F.interpolate(uv.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
170
+ noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
171
+ noise = noise_amp * torch.randn_like(sine_waves)
172
+ sine_waves = sine_waves * uv + noise
173
+ return sine_waves, uv, noise
174
+
175
+
176
+ class SourceModuleHnNSF(torch.nn.Module):
177
+ """ SourceModule for hn-nsf
178
+ SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
179
+ add_noise_std=0.003, voiced_threshod=0)
180
+ sampling_rate: sampling_rate in Hz
181
+ harmonic_num: number of harmonic above F0 (default: 0)
182
+ sine_amp: amplitude of sine source signal (default: 0.1)
183
+ add_noise_std: std of additive Gaussian noise (default: 0.003)
184
+ note that amplitude of noise in unvoiced is decided
185
+ by sine_amp
186
+ voiced_threshold: threhold to set U/V given F0 (default: 0)
187
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
188
+ F0_sampled (batchsize, length, 1)
189
+ Sine_source (batchsize, length, 1)
190
+ noise_source (batchsize, length 1)
191
+ uv (batchsize, length, 1)
192
+ """
193
+
194
+ def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
195
+ add_noise_std=0.003, voiced_threshod=0):
196
+ super(SourceModuleHnNSF, self).__init__()
197
+
198
+ self.sine_amp = sine_amp
199
+ self.noise_std = add_noise_std
200
+
201
+ # to produce sine waveforms
202
+ self.l_sin_gen = SineGen(sampling_rate, harmonic_num,
203
+ sine_amp, add_noise_std, voiced_threshod)
204
+
205
+ # to merge source harmonics into a single excitation
206
+ self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
207
+ self.l_tanh = torch.nn.Tanh()
208
+
209
+ def forward(self, x, upp):
210
+ sine_wavs, uv, _ = self.l_sin_gen(x, upp)
211
+ sine_merge = self.l_tanh(self.l_linear(sine_wavs))
212
+ return sine_merge
213
+
214
+
215
+ class Generator(torch.nn.Module):
216
+ def __init__(self, h):
217
+ super(Generator, self).__init__()
218
+ self.h = h
219
+ self.num_kernels = len(h.resblock_kernel_sizes)
220
+ self.num_upsamples = len(h.upsample_rates)
221
+ self.m_source = SourceModuleHnNSF(
222
+ sampling_rate=h.sampling_rate,
223
+ harmonic_num=8
224
+ )
225
+ self.noise_convs = nn.ModuleList()
226
+ self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3))
227
+ resblock = ResBlock1 if h.resblock == '1' else ResBlock2
228
+
229
+ self.ups = nn.ModuleList()
230
+ for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
231
+ c_cur = h.upsample_initial_channel // (2 ** (i + 1))
232
+ self.ups.append(weight_norm(
233
+ ConvTranspose1d(h.upsample_initial_channel // (2 ** i), h.upsample_initial_channel // (2 ** (i + 1)),
234
+ k, u, padding=(k - u) // 2)))
235
+ if i + 1 < len(h.upsample_rates): #
236
+ stride_f0 = int(np.prod(h.upsample_rates[i + 1:]))
237
+ self.noise_convs.append(Conv1d(
238
+ 1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
239
+ else:
240
+ self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
241
+ self.resblocks = nn.ModuleList()
242
+ ch = h.upsample_initial_channel
243
+ for i in range(len(self.ups)):
244
+ ch //= 2
245
+ for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
246
+ self.resblocks.append(resblock(h, ch, k, d))
247
+
248
+ self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
249
+ self.ups.apply(init_weights)
250
+ self.conv_post.apply(init_weights)
251
+ self.upp = int(np.prod(h.upsample_rates))
252
+
253
+ def forward(self, x, f0):
254
+ har_source = self.m_source(f0, self.upp).transpose(1, 2)
255
+ x = self.conv_pre(x)
256
+ for i in range(self.num_upsamples):
257
+ x = F.leaky_relu(x, LRELU_SLOPE)
258
+ x = self.ups[i](x)
259
+ x_source = self.noise_convs[i](har_source)
260
+ x = x + x_source
261
+ xs = None
262
+ for j in range(self.num_kernels):
263
+ if xs is None:
264
+ xs = self.resblocks[i * self.num_kernels + j](x)
265
+ else:
266
+ xs += self.resblocks[i * self.num_kernels + j](x)
267
+ x = xs / self.num_kernels
268
+ x = F.leaky_relu(x)
269
+ x = self.conv_post(x)
270
+ x = torch.tanh(x)
271
+
272
+ return x
273
+
274
+ def remove_weight_norm(self):
275
+ print('Removing weight norm...')
276
+ for l in self.ups:
277
+ remove_weight_norm(l)
278
+ for l in self.resblocks:
279
+ l.remove_weight_norm()
280
+ remove_weight_norm(self.conv_pre)
281
+ remove_weight_norm(self.conv_post)
282
+
283
+
284
+ class DiscriminatorP(torch.nn.Module):
285
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
286
+ super(DiscriminatorP, self).__init__()
287
+ self.period = period
288
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
289
+ self.convs = nn.ModuleList([
290
+ norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
291
+ norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
292
+ norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
293
+ norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
294
+ norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
295
+ ])
296
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
297
+
298
+ def forward(self, x):
299
+ fmap = []
300
+
301
+ # 1d to 2d
302
+ b, c, t = x.shape
303
+ if t % self.period != 0: # pad first
304
+ n_pad = self.period - (t % self.period)
305
+ x = F.pad(x, (0, n_pad), "reflect")
306
+ t = t + n_pad
307
+ x = x.view(b, c, t // self.period, self.period)
308
+
309
+ for l in self.convs:
310
+ x = l(x)
311
+ x = F.leaky_relu(x, LRELU_SLOPE)
312
+ fmap.append(x)
313
+ x = self.conv_post(x)
314
+ fmap.append(x)
315
+ x = torch.flatten(x, 1, -1)
316
+
317
+ return x, fmap
318
+
319
+
320
+ class MultiPeriodDiscriminator(torch.nn.Module):
321
+ def __init__(self, periods=None):
322
+ super(MultiPeriodDiscriminator, self).__init__()
323
+ self.periods = periods if periods is not None else [2, 3, 5, 7, 11]
324
+ self.discriminators = nn.ModuleList()
325
+ for period in self.periods:
326
+ self.discriminators.append(DiscriminatorP(period))
327
+
328
+ def forward(self, y, y_hat):
329
+ y_d_rs = []
330
+ y_d_gs = []
331
+ fmap_rs = []
332
+ fmap_gs = []
333
+ for i, d in enumerate(self.discriminators):
334
+ y_d_r, fmap_r = d(y)
335
+ y_d_g, fmap_g = d(y_hat)
336
+ y_d_rs.append(y_d_r)
337
+ fmap_rs.append(fmap_r)
338
+ y_d_gs.append(y_d_g)
339
+ fmap_gs.append(fmap_g)
340
+
341
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
342
+
343
+
344
+ class DiscriminatorS(torch.nn.Module):
345
+ def __init__(self, use_spectral_norm=False):
346
+ super(DiscriminatorS, self).__init__()
347
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
348
+ self.convs = nn.ModuleList([
349
+ norm_f(Conv1d(1, 128, 15, 1, padding=7)),
350
+ norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
351
+ norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
352
+ norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
353
+ norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
354
+ norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
355
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
356
+ ])
357
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
358
+
359
+ def forward(self, x):
360
+ fmap = []
361
+ for l in self.convs:
362
+ x = l(x)
363
+ x = F.leaky_relu(x, LRELU_SLOPE)
364
+ fmap.append(x)
365
+ x = self.conv_post(x)
366
+ fmap.append(x)
367
+ x = torch.flatten(x, 1, -1)
368
+
369
+ return x, fmap
370
+
371
+
372
+ class MultiScaleDiscriminator(torch.nn.Module):
373
+ def __init__(self):
374
+ super(MultiScaleDiscriminator, self).__init__()
375
+ self.discriminators = nn.ModuleList([
376
+ DiscriminatorS(use_spectral_norm=True),
377
+ DiscriminatorS(),
378
+ DiscriminatorS(),
379
+ ])
380
+ self.meanpools = nn.ModuleList([
381
+ AvgPool1d(4, 2, padding=2),
382
+ AvgPool1d(4, 2, padding=2)
383
+ ])
384
+
385
+ def forward(self, y, y_hat):
386
+ y_d_rs = []
387
+ y_d_gs = []
388
+ fmap_rs = []
389
+ fmap_gs = []
390
+ for i, d in enumerate(self.discriminators):
391
+ if i != 0:
392
+ y = self.meanpools[i - 1](y)
393
+ y_hat = self.meanpools[i - 1](y_hat)
394
+ y_d_r, fmap_r = d(y)
395
+ y_d_g, fmap_g = d(y_hat)
396
+ y_d_rs.append(y_d_r)
397
+ fmap_rs.append(fmap_r)
398
+ y_d_gs.append(y_d_g)
399
+ fmap_gs.append(fmap_g)
400
+
401
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
402
+
403
+
404
+ def feature_loss(fmap_r, fmap_g):
405
+ loss = 0
406
+ for dr, dg in zip(fmap_r, fmap_g):
407
+ for rl, gl in zip(dr, dg):
408
+ loss += torch.mean(torch.abs(rl - gl))
409
+
410
+ return loss * 2
411
+
412
+
413
+ def discriminator_loss(disc_real_outputs, disc_generated_outputs):
414
+ loss = 0
415
+ r_losses = []
416
+ g_losses = []
417
+ for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
418
+ r_loss = torch.mean((1 - dr) ** 2)
419
+ g_loss = torch.mean(dg ** 2)
420
+ loss += (r_loss + g_loss)
421
+ r_losses.append(r_loss.item())
422
+ g_losses.append(g_loss.item())
423
+
424
+ return loss, r_losses, g_losses
425
+
426
+
427
+ def generator_loss(disc_outputs):
428
+ loss = 0
429
+ gen_losses = []
430
+ for dg in disc_outputs:
431
+ l = torch.mean((1 - dg) ** 2)
432
+ gen_losses.append(l)
433
+ loss += l
434
+
435
+ return loss, gen_losses
nsf_hifigan/nvSTFT.py ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import os
3
+ os.environ["LRU_CACHE_CAPACITY"] = "3"
4
+ import random
5
+ import torch
6
+ import torch.utils.data
7
+ import numpy as np
8
+ import librosa
9
+ from librosa.util import normalize
10
+ from librosa.filters import mel as librosa_mel_fn
11
+ from scipy.io.wavfile import read
12
+ import soundfile as sf
13
+ import torch.nn.functional as F
14
+
15
+ def load_wav_to_torch(full_path, target_sr=None, return_empty_on_exception=False):
16
+ sampling_rate = None
17
+ try:
18
+ data, sampling_rate = sf.read(full_path, always_2d=True)# than soundfile.
19
+ except Exception as ex:
20
+ print(f"'{full_path}' failed to load.\nException:")
21
+ print(ex)
22
+ if return_empty_on_exception:
23
+ return [], sampling_rate or target_sr or 48000
24
+ else:
25
+ raise Exception(ex)
26
+
27
+ if len(data.shape) > 1:
28
+ data = data[:, 0]
29
+ assert len(data) > 2# check duration of audio file is > 2 samples (because otherwise the slice operation was on the wrong dimension)
30
+
31
+ if np.issubdtype(data.dtype, np.integer): # if audio data is type int
32
+ max_mag = -np.iinfo(data.dtype).min # maximum magnitude = min possible value of intXX
33
+ else: # if audio data is type fp32
34
+ max_mag = max(np.amax(data), -np.amin(data))
35
+ max_mag = (2**31)+1 if max_mag > (2**15) else ((2**15)+1 if max_mag > 1.01 else 1.0) # data should be either 16-bit INT, 32-bit INT or [-1 to 1] float32
36
+
37
+ data = torch.FloatTensor(data.astype(np.float32))/max_mag
38
+
39
+ if (torch.isinf(data) | torch.isnan(data)).any() and return_empty_on_exception:# resample will crash with inf/NaN inputs. return_empty_on_exception will return empty arr instead of except
40
+ return [], sampling_rate or target_sr or 48000
41
+ if target_sr is not None and sampling_rate != target_sr:
42
+ data = torch.from_numpy(librosa.core.resample(data.numpy(), orig_sr=sampling_rate, target_sr=target_sr))
43
+ sampling_rate = target_sr
44
+
45
+ return data, sampling_rate
46
+
47
+ def dynamic_range_compression(x, C=1, clip_val=1e-5):
48
+ return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
49
+
50
+ def dynamic_range_decompression(x, C=1):
51
+ return np.exp(x) / C
52
+
53
+ def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
54
+ return torch.log(torch.clamp(x, min=clip_val) * C)
55
+
56
+ def dynamic_range_decompression_torch(x, C=1):
57
+ return torch.exp(x) / C
58
+
59
+ class STFT():
60
+ def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop_length=256, fmin=20, fmax=11025, clip_val=1e-5):
61
+ self.target_sr = sr
62
+
63
+ self.n_mels = n_mels
64
+ self.n_fft = n_fft
65
+ self.win_size = win_size
66
+ self.hop_length = hop_length
67
+ self.fmin = fmin
68
+ self.fmax = fmax
69
+ self.clip_val = clip_val
70
+ self.mel_basis = {}
71
+ self.hann_window = {}
72
+
73
+ def get_mel(self, y, keyshift=0, speed=1, center=False):
74
+ sampling_rate = self.target_sr
75
+ n_mels = self.n_mels
76
+ n_fft = self.n_fft
77
+ win_size = self.win_size
78
+ hop_length = self.hop_length
79
+ fmin = self.fmin
80
+ fmax = self.fmax
81
+ clip_val = self.clip_val
82
+
83
+ factor = 2 ** (keyshift / 12)
84
+ n_fft_new = int(np.round(n_fft * factor))
85
+ win_size_new = int(np.round(win_size * factor))
86
+ hop_length_new = int(np.round(hop_length * speed))
87
+
88
+ if torch.min(y) < -1.:
89
+ print('min value is ', torch.min(y))
90
+ if torch.max(y) > 1.:
91
+ print('max value is ', torch.max(y))
92
+
93
+ mel_basis_key = str(fmax)+'_'+str(y.device)
94
+ if mel_basis_key not in self.mel_basis:
95
+ mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
96
+ self.mel_basis[mel_basis_key] = torch.from_numpy(mel).float().to(y.device)
97
+
98
+ keyshift_key = str(keyshift)+'_'+str(y.device)
99
+ if keyshift_key not in self.hann_window:
100
+ self.hann_window[keyshift_key] = torch.hann_window(win_size_new).to(y.device)
101
+
102
+ pad_left = (win_size_new - hop_length_new) //2
103
+ pad_right = max((win_size_new- hop_length_new + 1) //2, win_size_new - y.size(-1) - pad_left)
104
+ if pad_right < y.size(-1):
105
+ mode = 'reflect'
106
+ else:
107
+ mode = 'constant'
108
+ y = torch.nn.functional.pad(y.unsqueeze(1), (pad_left, pad_right), mode = mode)
109
+ y = y.squeeze(1)
110
+
111
+ spec = torch.stft(y, n_fft_new, hop_length=hop_length_new, win_length=win_size_new, window=self.hann_window[keyshift_key],
112
+ center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
113
+ # print(111,spec)
114
+ spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
115
+ if keyshift != 0:
116
+ size = n_fft // 2 + 1
117
+ resize = spec.size(1)
118
+ if resize < size:
119
+ spec = F.pad(spec, (0, 0, 0, size-resize))
120
+ spec = spec[:, :size, :] * win_size / win_size_new
121
+
122
+ # print(222,spec)
123
+ spec = torch.matmul(self.mel_basis[mel_basis_key], spec)
124
+ # print(333,spec)
125
+ spec = dynamic_range_compression_torch(spec, clip_val=clip_val)
126
+ # print(444,spec)
127
+ return spec
128
+
129
+ def __call__(self, audiopath):
130
+ audio, sr = load_wav_to_torch(audiopath, target_sr=self.target_sr)
131
+ spect = self.get_mel(audio.unsqueeze(0)).squeeze(0)
132
+ return spect
133
+
134
+ stft = STFT()
nsf_hifigan/utils.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import glob
2
+ import os
3
+ import matplotlib
4
+ import torch
5
+ from torch.nn.utils import weight_norm
6
+ matplotlib.use("Agg")
7
+ import matplotlib.pylab as plt
8
+
9
+
10
+ def plot_spectrogram(spectrogram):
11
+ fig, ax = plt.subplots(figsize=(10, 2))
12
+ im = ax.imshow(spectrogram, aspect="auto", origin="lower",
13
+ interpolation='none')
14
+ plt.colorbar(im, ax=ax)
15
+
16
+ fig.canvas.draw()
17
+ plt.close()
18
+
19
+ return fig
20
+
21
+
22
+ def init_weights(m, mean=0.0, std=0.01):
23
+ classname = m.__class__.__name__
24
+ if classname.find("Conv") != -1:
25
+ m.weight.data.normal_(mean, std)
26
+
27
+
28
+ def apply_weight_norm(m):
29
+ classname = m.__class__.__name__
30
+ if classname.find("Conv") != -1:
31
+ weight_norm(m)
32
+
33
+
34
+ def get_padding(kernel_size, dilation=1):
35
+ return int((kernel_size*dilation - dilation)/2)
36
+
37
+
38
+ def load_checkpoint(filepath, device):
39
+ assert os.path.isfile(filepath)
40
+ print("Loading '{}'".format(filepath))
41
+ checkpoint_dict = torch.load(filepath, map_location=device)
42
+ print("Complete.")
43
+ return checkpoint_dict
44
+
45
+
46
+ def save_checkpoint(filepath, obj):
47
+ print("Saving checkpoint to {}".format(filepath))
48
+ torch.save(obj, filepath)
49
+ print("Complete.")
50
+
51
+
52
+ def del_old_checkpoints(cp_dir, prefix, n_models=2):
53
+ pattern = os.path.join(cp_dir, prefix + '????????')
54
+ cp_list = glob.glob(pattern) # get checkpoint paths
55
+ cp_list = sorted(cp_list)# sort by iter
56
+ if len(cp_list) > n_models: # if more than n_models models are found
57
+ for cp in cp_list[:-n_models]:# delete the oldest models other than lastest n_models
58
+ open(cp, 'w').close()# empty file contents
59
+ os.unlink(cp)# delete file (move to trash when using Colab)
60
+
61
+
62
+ def scan_checkpoint(cp_dir, prefix):
63
+ pattern = os.path.join(cp_dir, prefix + '????????')
64
+ cp_list = glob.glob(pattern)
65
+ if len(cp_list) == 0:
66
+ return None
67
+ return sorted(cp_list)[-1]
68
+
preprocess.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import numpy as np
3
+ import librosa
4
+ import torch
5
+ import pyworld as pw
6
+ import parselmouth
7
+ import argparse
8
+ import shutil
9
+ from logger import utils
10
+ from tqdm import tqdm
11
+ from ddsp.vocoder import F0_Extractor, Volume_Extractor, Units_Encoder
12
+ from logger.utils import traverse_dir
13
+ import concurrent.futures
14
+
15
+ def parse_args(args=None, namespace=None):
16
+ """Parse command-line arguments."""
17
+ parser = argparse.ArgumentParser()
18
+ parser.add_argument(
19
+ "-c",
20
+ "--config",
21
+ type=str,
22
+ required=True,
23
+ help="path to the config file")
24
+ return parser.parse_args(args=args, namespace=namespace)
25
+
26
+ def preprocess(path, f0_extractor, volume_extractor, units_encoder, sample_rate, hop_size, device = 'cuda'):
27
+
28
+ path_srcdir = os.path.join(path, 'audio')
29
+ path_unitsdir = os.path.join(path, 'units')
30
+ path_f0dir = os.path.join(path, 'f0')
31
+ path_volumedir = os.path.join(path, 'volume')
32
+ path_skipdir = os.path.join(path, 'skip')
33
+
34
+ # list files
35
+ filelist = traverse_dir(
36
+ path_srcdir,
37
+ extension='wav',
38
+ is_pure=True,
39
+ is_sort=True,
40
+ is_ext=True)
41
+
42
+ # run
43
+ def process(file):
44
+ ext = file.split('.')[-1]
45
+ binfile = file[:-(len(ext)+1)]+'.npy'
46
+ path_srcfile = os.path.join(path_srcdir, file)
47
+ path_unitsfile = os.path.join(path_unitsdir, binfile)
48
+ path_f0file = os.path.join(path_f0dir, binfile)
49
+ path_volumefile = os.path.join(path_volumedir, binfile)
50
+ path_skipfile = os.path.join(path_skipdir, file)
51
+
52
+ # load audio
53
+ audio, _ = librosa.load(path_srcfile, sr=sample_rate)
54
+ if len(audio.shape) > 1:
55
+ audio = librosa.to_mono(audio)
56
+ audio_t = torch.from_numpy(audio).float().to(device)
57
+ audio_t = audio_t.unsqueeze(0)
58
+
59
+ # extract volume
60
+ volume = volume_extractor.extract(audio)
61
+
62
+ # units encode
63
+ units_t = units_encoder.encode(audio_t, sample_rate, hop_size)
64
+ units = units_t.squeeze().to('cpu').numpy()
65
+
66
+ # extract f0
67
+ f0 = f0_extractor.extract(audio, uv_interp = False)
68
+
69
+ uv = f0 == 0
70
+ if len(f0[~uv]) > 0:
71
+ # interpolate the unvoiced f0
72
+ f0[uv] = np.interp(np.where(uv)[0], np.where(~uv)[0], f0[~uv])
73
+
74
+ # save npy
75
+ os.makedirs(os.path.dirname(path_unitsfile), exist_ok=True)
76
+ np.save(path_unitsfile, units)
77
+ os.makedirs(os.path.dirname(path_f0file), exist_ok=True)
78
+ np.save(path_f0file, f0)
79
+ os.makedirs(os.path.dirname(path_volumefile), exist_ok=True)
80
+ np.save(path_volumefile, volume)
81
+ else:
82
+ print('\n[Error] F0 extraction failed: ' + path_srcfile)
83
+ os.makedirs(os.path.dirname(path_skipfile), exist_ok=True)
84
+ shutil.move(path_srcfile, os.path.dirname(path_skipfile))
85
+ print('This file has been moved to ' + path_skipfile)
86
+ print('Preprocess the audio clips in :', path_srcdir)
87
+
88
+ # single process
89
+ for file in tqdm(filelist, total=len(filelist)):
90
+ process(file)
91
+
92
+ # multi-process (have bugs)
93
+ '''
94
+ with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
95
+ list(tqdm(executor.map(process, filelist), total=len(filelist)))
96
+ '''
97
+
98
+ if __name__ == '__main__':
99
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
100
+
101
+ # parse commands
102
+ cmd = parse_args()
103
+
104
+ # load config
105
+ args = utils.load_config(cmd.config)
106
+ sample_rate = args.data.sampling_rate
107
+ hop_size = args.data.block_size
108
+
109
+ # initialize f0 extractor
110
+ f0_extractor = F0_Extractor(
111
+ args.data.f0_extractor,
112
+ args.data.sampling_rate,
113
+ args.data.block_size,
114
+ args.data.f0_min,
115
+ args.data.f0_max)
116
+
117
+ # initialize volume extractor
118
+ volume_extractor = Volume_Extractor(args.data.block_size)
119
+
120
+ # initialize units encoder
121
+ units_encoder = Units_Encoder(
122
+ args.data.encoder,
123
+ args.data.encoder_ckpt,
124
+ args.data.encoder_sample_rate,
125
+ args.data.encoder_hop_size,
126
+ device = device)
127
+
128
+ # preprocess training set
129
+ preprocess(args.data.train_path, f0_extractor, volume_extractor, units_encoder, sample_rate, hop_size, device = device)
130
+
131
+ # preprocess validation set
132
+ preprocess(args.data.valid_path, f0_extractor, volume_extractor, units_encoder, sample_rate, hop_size, device = device)
133
+
pretrain/gitkeep ADDED
File without changes
requirements.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ einops
2
+ fairseq
3
+ flask
4
+ flask_cors
5
+ gin
6
+ gin_config
7
+ librosa
8
+ local_attention
9
+ matplotlib
10
+ numpy
11
+ praat-parselmouth
12
+ pyworld
13
+ PyYAML
14
+ resampy
15
+ scikit_learn
16
+ scipy
17
+ SoundFile
18
+ tensorboard
19
+ torchcrepe
20
+ tqdm
21
+ wave
22
+ pysimplegui
23
+ sounddevice
samples/source.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22422561a72d7bcb588503be9a1188057f5ebd910c796f7c77c268f484de9115
3
+ size 3087746
samples/svc-kiritan+12key.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f009b74e725aade5acc0902f72085e2a6cb63e3ff7db21e8662f8521ebca18c1
3
+ size 2830380
samples/svc-opencpop+12key.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e41b18c94d12ef0a1b1d0cfcce87ccf3db5da931f4b3718a26c0a2c018d19ba1
3
+ size 2830380
samples/svc-opencpop_kiritan_mix+12key.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f531ec9601371358da2b83eeb5326482ba534aba9e97d66a5aad5602118ce09
3
+ size 2830380
slicer.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import librosa
2
+ import torch
3
+ import torchaudio
4
+
5
+
6
+ class Slicer:
7
+ def __init__(self,
8
+ sr: int,
9
+ threshold: float = -40.,
10
+ min_length: int = 5000,
11
+ min_interval: int = 300,
12
+ hop_size: int = 20,
13
+ max_sil_kept: int = 5000):
14
+ if not min_length >= min_interval >= hop_size:
15
+ raise ValueError('The following condition must be satisfied: min_length >= min_interval >= hop_size')
16
+ if not max_sil_kept >= hop_size:
17
+ raise ValueError('The following condition must be satisfied: max_sil_kept >= hop_size')
18
+ min_interval = sr * min_interval / 1000
19
+ self.threshold = 10 ** (threshold / 20.)
20
+ self.hop_size = round(sr * hop_size / 1000)
21
+ self.win_size = min(round(min_interval), 4 * self.hop_size)
22
+ self.min_length = round(sr * min_length / 1000 / self.hop_size)
23
+ self.min_interval = round(min_interval / self.hop_size)
24
+ self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)
25
+
26
+ def _apply_slice(self, waveform, begin, end):
27
+ if len(waveform.shape) > 1:
28
+ return waveform[:, begin * self.hop_size: min(waveform.shape[1], end * self.hop_size)]
29
+ else:
30
+ return waveform[begin * self.hop_size: min(waveform.shape[0], end * self.hop_size)]
31
+
32
+ # @timeit
33
+ def slice(self, waveform):
34
+ if len(waveform.shape) > 1:
35
+ samples = librosa.to_mono(waveform)
36
+ else:
37
+ samples = waveform
38
+ if samples.shape[0] <= self.min_length:
39
+ return {"0": {"slice": False, "split_time": f"0,{len(waveform)}"}}
40
+ rms_list = librosa.feature.rms(y=samples, frame_length=self.win_size, hop_length=self.hop_size).squeeze(0)
41
+ sil_tags = []
42
+ silence_start = None
43
+ clip_start = 0
44
+ for i, rms in enumerate(rms_list):
45
+ # Keep looping while frame is silent.
46
+ if rms < self.threshold:
47
+ # Record start of silent frames.
48
+ if silence_start is None:
49
+ silence_start = i
50
+ continue
51
+ # Keep looping while frame is not silent and silence start has not been recorded.
52
+ if silence_start is None:
53
+ continue
54
+ # Clear recorded silence start if interval is not enough or clip is too short
55
+ is_leading_silence = silence_start == 0 and i > self.max_sil_kept
56
+ need_slice_middle = i - silence_start >= self.min_interval and i - clip_start >= self.min_length
57
+ if not is_leading_silence and not need_slice_middle:
58
+ silence_start = None
59
+ continue
60
+ # Need slicing. Record the range of silent frames to be removed.
61
+ if i - silence_start <= self.max_sil_kept:
62
+ pos = rms_list[silence_start: i + 1].argmin() + silence_start
63
+ if silence_start == 0:
64
+ sil_tags.append((0, pos))
65
+ else:
66
+ sil_tags.append((pos, pos))
67
+ clip_start = pos
68
+ elif i - silence_start <= self.max_sil_kept * 2:
69
+ pos = rms_list[i - self.max_sil_kept: silence_start + self.max_sil_kept + 1].argmin()
70
+ pos += i - self.max_sil_kept
71
+ pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
72
+ pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
73
+ if silence_start == 0:
74
+ sil_tags.append((0, pos_r))
75
+ clip_start = pos_r
76
+ else:
77
+ sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
78
+ clip_start = max(pos_r, pos)
79
+ else:
80
+ pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
81
+ pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
82
+ if silence_start == 0:
83
+ sil_tags.append((0, pos_r))
84
+ else:
85
+ sil_tags.append((pos_l, pos_r))
86
+ clip_start = pos_r
87
+ silence_start = None
88
+ # Deal with trailing silence.
89
+ total_frames = rms_list.shape[0]
90
+ if silence_start is not None and total_frames - silence_start >= self.min_interval:
91
+ silence_end = min(total_frames, silence_start + self.max_sil_kept)
92
+ pos = rms_list[silence_start: silence_end + 1].argmin() + silence_start
93
+ sil_tags.append((pos, total_frames + 1))
94
+ # Apply and return slices.
95
+ if len(sil_tags) == 0:
96
+ return {"0": {"slice": False, "split_time": f"0,{len(waveform)}"}}
97
+ else:
98
+ chunks = []
99
+ # 第一段静音并非从头开始,补上有声片段
100
+ if sil_tags[0][0]:
101
+ chunks.append(
102
+ {"slice": False, "split_time": f"0,{min(waveform.shape[0], sil_tags[0][0] * self.hop_size)}"})
103
+ for i in range(0, len(sil_tags)):
104
+ # 标识有声片段(跳过第一段)
105
+ if i:
106
+ chunks.append({"slice": False,
107
+ "split_time": f"{sil_tags[i - 1][1] * self.hop_size},{min(waveform.shape[0], sil_tags[i][0] * self.hop_size)}"})
108
+ # 标识所有静音片段
109
+ chunks.append({"slice": True,
110
+ "split_time": f"{sil_tags[i][0] * self.hop_size},{min(waveform.shape[0], sil_tags[i][1] * self.hop_size)}"})
111
+ # 最后一段静音并非结尾,补上结尾片段
112
+ if sil_tags[-1][1] * self.hop_size < len(waveform):
113
+ chunks.append({"slice": False, "split_time": f"{sil_tags[-1][1] * self.hop_size},{len(waveform)}"})
114
+ chunk_dict = {}
115
+ for i in range(len(chunks)):
116
+ chunk_dict[str(i)] = chunks[i]
117
+ return chunk_dict
118
+
119
+
120
+ def cut(audio_path, db_thresh=-30, min_len=5000, flask_mode=False, flask_sr=None):
121
+ if not flask_mode:
122
+ audio, sr = librosa.load(audio_path, sr=None)
123
+ else:
124
+ audio = audio_path
125
+ sr = flask_sr
126
+ slicer = Slicer(
127
+ sr=sr,
128
+ threshold=db_thresh,
129
+ min_length=min_len
130
+ )
131
+ chunks = slicer.slice(audio)
132
+ return chunks
133
+
134
+
135
+ def chunks2audio(audio_path, chunks):
136
+ chunks = dict(chunks)
137
+ audio, sr = torchaudio.load(audio_path)
138
+ if len(audio.shape) == 2 and audio.shape[1] >= 2:
139
+ audio = torch.mean(audio, dim=0).unsqueeze(0)
140
+ audio = audio.cpu().numpy()[0]
141
+ result = []
142
+ for k, v in chunks.items():
143
+ tag = v["split_time"].split(",")
144
+ if tag[0] != tag[1]:
145
+ result.append((v["slice"], audio[int(tag[0]):int(tag[1])]))
146
+ return result, sr
solver.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import numpy as np
4
+ import torch
5
+
6
+ from logger.saver import Saver
7
+ from logger import utils
8
+
9
+ def test(args, model, loss_func, loader_test, saver):
10
+ print(' [*] testing...')
11
+ model.eval()
12
+
13
+ # losses
14
+ test_loss = 0.
15
+ test_loss_rss = 0.
16
+ test_loss_uv = 0.
17
+
18
+ # intialization
19
+ num_batches = len(loader_test)
20
+ rtf_all = []
21
+
22
+ # run
23
+ with torch.no_grad():
24
+ for bidx, data in enumerate(loader_test):
25
+ fn = data['name'][0]
26
+ print('--------')
27
+ print('{}/{} - {}'.format(bidx, num_batches, fn))
28
+
29
+ # unpack data
30
+ for k in data.keys():
31
+ if k != 'name':
32
+ data[k] = data[k].to(args.device)
33
+ print('>>', data['name'][0])
34
+
35
+ # forward
36
+ st_time = time.time()
37
+ signal, _, (s_h, s_n) = model(data['units'], data['f0'], data['volume'], data['spk_id'])
38
+ ed_time = time.time()
39
+
40
+ # crop
41
+ min_len = np.min([signal.shape[1], data['audio'].shape[1]])
42
+ signal = signal[:,:min_len]
43
+ data['audio'] = data['audio'][:,:min_len]
44
+
45
+ # RTF
46
+ run_time = ed_time - st_time
47
+ song_time = data['audio'].shape[-1] / args.data.sampling_rate
48
+ rtf = run_time / song_time
49
+ print('RTF: {} | {} / {}'.format(rtf, run_time, song_time))
50
+ rtf_all.append(rtf)
51
+
52
+ # loss
53
+ loss = loss_func(signal, data['audio'])
54
+
55
+ test_loss += loss.item()
56
+
57
+ # log
58
+ saver.log_audio({fn+'/gt.wav': data['audio'], fn+'/pred.wav': signal})
59
+
60
+ # report
61
+ test_loss /= num_batches
62
+
63
+ # check
64
+ print(' [test_loss] test_loss:', test_loss)
65
+ print(' Real Time Factor', np.mean(rtf_all))
66
+ return test_loss
67
+
68
+
69
+ def train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_test):
70
+ # saver
71
+ saver = Saver(args, initial_global_step=initial_global_step)
72
+
73
+ # model size
74
+ params_count = utils.get_network_paras_amount({'model': model})
75
+ saver.log_info('--- model size ---')
76
+ saver.log_info(params_count)
77
+
78
+ # run
79
+ best_loss = np.inf
80
+ num_batches = len(loader_train)
81
+ model.train()
82
+ saver.log_info('======= start training =======')
83
+ for epoch in range(args.train.epochs):
84
+ for batch_idx, data in enumerate(loader_train):
85
+ saver.global_step_increment()
86
+ optimizer.zero_grad()
87
+
88
+ # unpack data
89
+ for k in data.keys():
90
+ if k != 'name':
91
+ data[k] = data[k].to(args.device)
92
+
93
+ # forward
94
+ signal, _, (s_h, s_n) = model(data['units'].float(), data['f0'], data['volume'], data['spk_id'], infer=False)
95
+
96
+ # loss
97
+ loss = loss_func(signal, data['audio'])
98
+
99
+ # handle nan loss
100
+ if torch.isnan(loss):
101
+ raise ValueError(' [x] nan loss ')
102
+ else:
103
+ # backpropagate
104
+ loss.backward()
105
+ optimizer.step()
106
+
107
+ # log loss
108
+ if saver.global_step % args.train.interval_log == 0:
109
+ saver.log_info(
110
+ 'epoch: {} | {:3d}/{:3d} | {} | batch/s: {:.2f} | loss: {:.3f} | time: {} | step: {}'.format(
111
+ epoch,
112
+ batch_idx,
113
+ num_batches,
114
+ args.env.expdir,
115
+ args.train.interval_log/saver.get_interval_time(),
116
+ loss.item(),
117
+ saver.get_total_time(),
118
+ saver.global_step
119
+ )
120
+ )
121
+
122
+ saver.log_value({
123
+ 'train/loss': loss.item()
124
+ })
125
+
126
+ # validation
127
+ if saver.global_step % args.train.interval_val == 0:
128
+ # save latest
129
+ saver.save_model(model, optimizer, postfix=f'{saver.global_step}')
130
+
131
+ # run testing set
132
+ test_loss = test(args, model, loss_func, loader_test, saver)
133
+
134
+ saver.log_info(
135
+ ' --- <validation> --- \nloss: {:.3f}. '.format(
136
+ test_loss,
137
+ )
138
+ )
139
+
140
+ saver.log_value({
141
+ 'validation/loss': test_loss
142
+ })
143
+ model.train()
144
+
145
+ # save best model
146
+ if test_loss < best_loss:
147
+ saver.log_info(' [V] best model updated.')
148
+ saver.save_model(model, optimizer, postfix='best')
149
+ best_loss = test_loss
150
+
151
+
train.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import argparse
3
+ import torch
4
+
5
+ from logger import utils
6
+ from data_loaders import get_data_loaders
7
+ from solver import train
8
+ from ddsp.vocoder import Sins, CombSub, CombSubFast
9
+ from ddsp.loss import RSSLoss
10
+
11
+
12
+ def parse_args(args=None, namespace=None):
13
+ """Parse command-line arguments."""
14
+ parser = argparse.ArgumentParser()
15
+ parser.add_argument(
16
+ "-c",
17
+ "--config",
18
+ type=str,
19
+ required=True,
20
+ help="path to the config file")
21
+ return parser.parse_args(args=args, namespace=namespace)
22
+
23
+
24
+ if __name__ == '__main__':
25
+ # parse commands
26
+ cmd = parse_args()
27
+
28
+ # load config
29
+ args = utils.load_config(cmd.config)
30
+ print(' > config:', cmd.config)
31
+ print(' > exp:', args.env.expdir)
32
+
33
+ # load model
34
+ model = None
35
+
36
+ if args.model.type == 'Sins':
37
+ model = Sins(
38
+ sampling_rate=args.data.sampling_rate,
39
+ block_size=args.data.block_size,
40
+ n_harmonics=args.model.n_harmonics,
41
+ n_mag_allpass=args.model.n_mag_allpass,
42
+ n_mag_noise=args.model.n_mag_noise,
43
+ n_unit=args.data.encoder_out_channels,
44
+ n_spk=args.model.n_spk)
45
+
46
+ elif args.model.type == 'CombSub':
47
+ model = CombSub(
48
+ sampling_rate=args.data.sampling_rate,
49
+ block_size=args.data.block_size,
50
+ n_mag_allpass=args.model.n_mag_allpass,
51
+ n_mag_harmonic=args.model.n_mag_harmonic,
52
+ n_mag_noise=args.model.n_mag_noise,
53
+ n_unit=args.data.encoder_out_channels,
54
+ n_spk=args.model.n_spk)
55
+
56
+ elif args.model.type == 'CombSubFast':
57
+ model = CombSubFast(
58
+ sampling_rate=args.data.sampling_rate,
59
+ block_size=args.data.block_size,
60
+ n_unit=args.data.encoder_out_channels,
61
+ n_spk=args.model.n_spk)
62
+
63
+ else:
64
+ raise ValueError(f" [x] Unknown Model: {args.model.type}")
65
+
66
+ # load parameters
67
+ optimizer = torch.optim.AdamW(model.parameters())
68
+ initial_global_step, model, optimizer = utils.load_model(args.env.expdir, model, optimizer, device=args.device)
69
+ for param_group in optimizer.param_groups:
70
+ param_group['lr'] = args.train.lr
71
+ param_group['weight_decay'] = args.train.weight_decay
72
+
73
+ # loss
74
+ loss_func = RSSLoss(args.loss.fft_min, args.loss.fft_max, args.loss.n_scale, device = args.device)
75
+
76
+ # device
77
+ if args.device == 'cuda':
78
+ torch.cuda.set_device(args.env.gpu_id)
79
+ model.to(args.device)
80
+
81
+ for state in optimizer.state.values():
82
+ for k, v in state.items():
83
+ if torch.is_tensor(v):
84
+ state[k] = v.to(args.device)
85
+
86
+ loss_func.to(args.device)
87
+
88
+ # datas
89
+ loader_train, loader_valid = get_data_loaders(args, whole_audio=False)
90
+
91
+ # run
92
+ train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_valid)
93
+