Spaces:

baibaibai
/

DDSP

Runtime error

App Files Files Community

baibaibai commited on Apr 2, 2023

Commit

2072d0c

1 Parent(s): f6881d0

Upload 39 files

Browse files

Files changed (40) hide show

.gitattributes +4 -0
LICENSE +21 -0
README.md +150 -12
cn_README.md +158 -0
configs/combsub-old.yaml +42 -0
configs/combsub.yaml +39 -0
configs/sins.yaml +42 -0
data/train/audio/gitkeep +0 -0
data/val/audio/gitkeep +0 -0
data_loaders.py +230 -0
ddsp/__init__.py +0 -0
ddsp/core.py +242 -0
ddsp/loss.py +57 -0
ddsp/pcmer.py +380 -0
ddsp/unit2control.py +86 -0
ddsp/vocoder.py +515 -0
draw.py +101 -0
encoder/hubert/model.py +293 -0
enhancer.py +105 -0
exp/gitkeep +0 -0
flask_api.py +173 -0
gui.py +299 -0
logger/__init__.py +0 -0
logger/saver.py +123 -0
logger/utils.py +121 -0
main.py +245 -0
nsf_hifigan/env.py +15 -0
nsf_hifigan/models.py +435 -0
nsf_hifigan/nvSTFT.py +134 -0
nsf_hifigan/utils.py +68 -0
preprocess.py +133 -0
pretrain/gitkeep +0 -0
requirements.txt +23 -0
samples/source.wav +3 -0
samples/svc-kiritan+12key.wav +3 -0
samples/svc-opencpop+12key.wav +3 -0
samples/svc-opencpop_kiritan_mix+12key.wav +3 -0
slicer.py +146 -0
solver.py +151 -0
train.py +93 -0

.gitattributes CHANGED Viewed

@@ -32,3 +32,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+samples/source.wav filter=lfs diff=lfs merge=lfs -text
+samples/svc-kiritan+12key.wav filter=lfs diff=lfs merge=lfs -text
+samples/svc-opencpop_kiritan_mix+12key.wav filter=lfs diff=lfs merge=lfs -text
+samples/svc-opencpop+12key.wav filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 yxlllc
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,150 @@
----
-title: DDSP
-emoji: 👁
-colorFrom: pink
-colorTo: purple
-sdk: gradio
-sdk_version: 3.23.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+Language: **English** [简体中文](./cn_README.md)
+# DDSP-SVC
+<div align="center">
+<img src="https://storage.googleapis.com/ddsp/github_images/ddsp_logo.png" width="200px" alt="logo"></img>
+</div>
+End-to-end singing voice conversion system based on DDSP (Differentiable Digital Signal Processing）.
+## 0. Introduction
+DDSP-SVC is a new open source singing voice conversion project dedicated to the development of free AI voice changer software that can be popularized on personal computers.
+Compared with the more famous [Diff-SVC](https://github.com/prophesier/diff-svc) and [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc),  its training and synthesis have much lower requirements for computer hardware, and the training time can be shortened by orders of magnitude.
+Although the original synthesis quality of DDSP is not ideal (the original output can be heard in tensorboard while training), after using the pre-trained vocoder-based enhancer, the sound quality for some dateset can reach a level close to SO-VITS-SVC.
+If the quality of the training data is very high, probably still Diff-SVC will have the highest sound quality. The demo outputs are in the `samples` folder,  and the related model checkpoint can be downloaded from the release page.
+Disclaimer: Please make sure to only train DDSP-SVC models with **legally obtained authorized data**, and do not use these models and any audio they synthesize for illegal purposes. The author of this repository is not responsible for any infringement, fraud and other illegal acts caused by the use of these model checkpoints and audio.
+Update log: I am too lazy to translate, please see the Chinese version readme.
+## 1. Installing the dependencies
+We recommend first installing PyTorch from the [**official website**](https://pytorch.org/), then run:
+```bash
+pip install -r requirements.txt
+```
+NOTE : I only test the code using python 3.8 (windows) + pytorch 1.9.1 + torchaudio 0.6.0, too new or too old dependencies may not work
+## 2. Configuring the pretrained model
+UPDATE:  ContentVec encoder is supported now. You can download the pretrained [ContentVec](https://ibm.ent.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr) encoder instead of HubertSoft encoder and modify the configuration file to use it.
+- **(Required)** Download the pretrained [**HubertSoft**](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt)   encoder and put it under `pretrain/hubert` folder.
+-  Get the pretrained vocoder-based enhancer from the [DiffSinger Community Vocoders Project](https://openvpi.github.io/vocoders) and unzip it into `pretrain/` folder
+## 3. Preprocessing
+Put all the training dataset (.wav format audio clips) in the below directory:
+`data/train/audio`.
+Put all the validation dataset (.wav format audio clips) in the below directory:
+`data/val/audio`.
+You can also run
+```bash
+python draw.py
+```
+to help you select validation data (you can adjust the parameters in `draw.py` to modify the number of extracted files and other parameters)
+Then run
+```bash
+python preprocess.py -c configs/combsub.yaml
+```
+for a model of combtooth substractive synthesiser (**recommend**), or run
+```bash
+python preprocess.py -c configs/sins.yaml
+```
+for a model of sinusoids additive synthesiser.
+You can modify the configuration file `config/<model_name>.yaml` before preprocessing. The default configuration is suitable for training 44.1khz high sampling rate synthesiser with GTX-1660 graphics card.
+NOTE 1: Please keep the sampling rate of all audio clips consistent with the sampling rate in the yaml configuration file ! If it is not consistent, the program can be executed safely, but the resampling during the training process will be very slow.
+NOTE 2: The total number of the audio clips for training dataset is recommended to be about 1000,  especially long audio clip can be cut into short segments, which will speed up the training, but the duration of all audio clips should not be less than 2 seconds. If there are too many audio clips, you need a large internal-memory or set the 'cache_all_data' option to false in the configuration file.
+NOTE 3: The total number of the audio clips for validation dataset is recommended to be about 10, please don't put too many or it will be very slow to do the validation.
+NOTE 4:  If your dataset is not very high quality, set 'f0_extractor' to 'crepe' in the config file.  The crepe algorithm has the best noise immunity, but at the cost of greatly increasing the time required for data preprocessing.
+UPDATE: Multi-speaker training is supported now. The 'n_spk' parameter in configuration file controls whether it is a multi-speaker model.  If you want to train a **multi-speaker** model, audio folders need to be named with **positive integers not greater than 'n_spk'** to represent speaker ids, the directory structure is like below:
+```bash
+# training dataset
+# the 1st speaker
+data/train/audio/1/aaa.wav
+data/train/audio/1/bbb.wav
+...
+# the 2nd speaker
+data/train/audio/2/ccc.wav
+data/train/audio/2/ddd.wav
+...
+# validation dataset
+# the 1st speaker
+data/val/audio/1/eee.wav
+data/val/audio/1/fff.wav
+...
+# the 2nd speaker
+data/val/audio/2/ggg.wav
+data/val/audio/2/hhh.wav
+...
+```
+If 'n_spk'  = 1, The directory structure of the **single speaker** model is still supported, which is like below:
+```bash
+# training dataset
+data/train/audio/aaa.wav
+data/train/audio/bbb.wav
+...
+# validation dataset
+data/val/audio/ccc.wav
+data/val/audio/ddd.wav
+...
+```
+## 4. Training
+```bash
+# train a combsub model as an example
+python train.py -c configs/combsub.yaml
+```
+The command line for training other models is similar.
+You can safely interrupt training, then running the same command line will resume training.
+You can also finetune the model if you interrupt training first, then re-preprocess the new dataset or change the training parameters (batchsize, lr etc.) and then run the same command line.
+## 5. Visualization
+```bash
+# check the training status using tensorboard
+tensorboard --logdir=exp
+```
+Test audio samples will be visible in TensorBoard after the first validation.
+NOTE: The test audio samples in Tensorboard are the original outputs of your DDSP-SVC model that is not enhanced by an enhancer. If you want to test the synthetic effect after using the enhancer  (which may have higher quality) , please use the method described in the following chapter.
+## 6. Testing
+(**Recommend**) Enhance the output using the pretrained vocoder-based enhancer:
+```bash
+# high audio quality in the normal vocal range if enhancer_adaptive_key = 0 (default)
+# set enhancer_adaptive_key > 0 to adapt the enhancer to a higher vocal range
+python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -eak <enhancer_adaptive_key (semitones)>
+```
+Raw output of DDSP:
+```bash
+# fast, but relatively low audio quality (like you hear in tensorboard)
+python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -e false
+```
+Other options about the f0 extractor and response threhold，see:
+```bash
+python main.py -h
+```
+(UPDATE) Mix-speaker is supported now. You can use "-mix" option to design your own vocal timbre, below is an example:
+```bash
+# Mix the timbre of 1st and 2nd speaker in a 0.5 to 0.5 ratio
+python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -mix "{1:0.5, 2:0.5}" -eak 0
+```
+## 7. HTTP Server and VST supported
+Start the server with the following command
+```bash
+# configs are in this python file, see the comments (Chinese only)
+python flask_api.py
+```
+Currently supported VST client:
+https://github.com/zhaohui8969/VST_NetProcess-
+## 8. Acknowledgement
+* [ddsp](https://github.com/magenta/ddsp)
+* [pc-ddsp](https://github.com/yxlllc/pc-ddsp)
+* [soft-vc](https://github.com/bshall/soft-vc)
+* [DiffSinger (OpenVPI version)](https://github.com/openvpi/DiffSinger)

cn_README.md ADDED Viewed

	@@ -0,0 +1,158 @@

+Language: [English](./README.md) ** 简体中文 **
+# DDSP-SVC
+<div align="center">
+<img src="https://storage.googleapis.com/ddsp/github_images/ddsp_logo.png" width="200px" alt="logo"></img>
+</div>
+基于 DDSP（可微分数字信号处理）的端到端歌声转换系统
+## 0.简介
+DDSP-SVC 是一个新的开源歌声转换项目，致力于开发可以在个人电脑上普及的自由 AI 变声器软件。
+相比于比较著名的 [Diff-SVC](https://github.com/prophesier/diff-svc) 和 [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc), 它训练和合成对电脑硬件的要求要低的多，并且训练时长有数量级的缩短。
+虽然 DDSP 的原始合成质量不是很理想（训练时在 tensorboard 中可以听到原始输出），但在使用基于预训练声码器的增强器增强音质后，对于部分数据集可以达到接近 SOVITS-SVC 的合成质量。
+如果训练数据的质量非常高，可能仍然 Diff-SVC 将拥有最高的合成质量。在`samples`文件夹中包含合成示例，相关模型检查点可以从仓库发布页面下载。
+免责声明：请确保仅使用**合法获得的授权数据**训练 DDSP-SVC 模型，不要将这些模型及其合成的任何音频用于非法目的。 本库作者不对因使用这些模型检查点和音频而造成的任何侵权，诈骗等违法行为负责。
+1.1 更新：支持多说话人和音色混合
+2.0 更新：开始支持实时 vst 插件，并优化了 combsub 模型， 训练速度极大提升。旧的 combsub 模型仍然兼容，可用 combsub-old.yaml 训练，sins 模型不受影响，但由于训练速度远慢于 combsub, 目前版本已经不推荐使用
+## 1. 安装依赖
+我们推荐从 [**PyTorch 官方网站 **](https://pytorch.org/) 下载 PyTorch.
+接着运行
+```bash
+pip install -r requirements.txt
+```
+注： 我只在 python 3.8 (windows) + pytorch 1.9.1 + torchaudio 0.6.0 测试过代码，太旧或太新的依赖可能会报错。
+## 2. 配置预训练模型
+更新：现在支持 ContentVec 编码器了。你可以下载预训练 [ContentVec](https://ibm.ent.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr) 编码器替代 HubertSoft 编码器并修改配置文件以使用它。
+- **(必要操作)** 下载预训练 [**HubertSoft**](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt) 编码器并将其放到 `pretrain/hubert` 文件夹.
+-  从 [DiffSinger 社区声码器项目](https://openvpi.github.io/vocoders) 下载基于预训练声码器的增强器，并解压至 `pretrain/` 文件夹。
+## 3. 预处理
+将所有的训练集数据 (.wav 格式音频切片) 放到 `data/train/audio`.
+将所有的验证集数据 (.wav 格式音频切片) 放到 `data/val/audio`.
+你也可以运行
+```bash
+python draw.py
+```
+帮助你挑选验证集数据（可以调整 `draw.py` 中的参数修改抽取文件的数量等参数）
+接着运行
+```bash
+python preprocess.py -c configs/combsub.yaml
+```
+训练基于梳齿波减法合成器的模型 (**推荐**)，或者运行
+```bash
+python preprocess.py -c configs/sins.yaml
+```
+训练基于正弦波加法合成器的模型。
+您可以在预处理之前修改配置文件 `config/<model_name>.yaml`。
+默认配置适用于GTX-1660 显卡训练 44.1khz 高采样率合成器。
+注 1: 请保持所有音频切片的采样率与 yaml 配置文件中的采样率一致！如果不一致，程序可以跑，但训练过程中的重新采样将非常缓慢。
+注 2：训练数据集的音频切片总数建议为约 1000 个，另外长音频切成小段可以加快训练速度，但所有音频切片的时长不应少于 2 秒。如果音频切片太多，则需要较大的内存，配置文件中将 `cache_all_data` 选项设置为 false 可以解决此问题。
+注 3：验证集的音频切片总数建议为 10 个左右，不要放太多，不然验证过程会很慢。
+注4：如果您的数据集质量不是很高，请在配置文件中将 'f0_extractor' 设为 'crepe'。crepe 算法的抗噪性最好，但代价是会极大增加数据预处理所需的时间。
+更新：现在支持多说话人训练了，配置文件中的 ‘n_spk’ 参数将控制是否训练多说话人模型。如果您要训练**多说话人**模型，为了对说话人进行编号，所有音频文件夹的名称必须是**不大于 ‘n_spk’ 的正整数**，目录结构如下所示：
+```bash
+# 训练集
+# 第1个说话人
+data/train/audio/1/aaa.wav
+data/train/audio/1/bbb.wav
+...
+# 第2个说话人
+data/train/audio/2/ccc.wav
+data/train/audio/2/ddd.wav
+...
+# 验证集
+# 第1个说话人
+data/val/audio/1/eee.wav
+data/val/audio/1/fff.wav
+...
+# 第2个说话人
+data/val/audio/2/ggg.wav
+data/val/audio/2/hhh.wav
+...
+```
+当 'n_spk' =1 时，之前**单说话人**模型的目录结构仍然支持，即：
+```bash
+# 训练集
+data/train/audio/aaa.wav
+data/train/audio/bbb.wav
+...
+# 验证集
+data/val/audio/ccc.wav
+data/val/audio/ddd.wav
+...
+```
+## 4. 训练
+```bash
+# 以训练 combsub 模型为例
+python train.py -c configs/combsub.yaml
+```
+训练其他模型方法类似。
+您可以随时中止训练，然后运行相同的命令来继续训练。
+您也可以在中止训练后，重新预处理新数据集或更改训练参数（batchsize、lr等），然后运行相同的命令，就可以对模型进行微调 (finetune)。
+## 5. 可视化
+```bash
+# 使用tensorboard检查训练状态
+tensorboard --logdir=exp
+```
+第一次验证 (validation) 后，在 TensorBoard 中可以看到合成后的测试音频。
+注：TensorBoard 中的测试音频是 DDSP-SVC 模型的原始输出，并未通过增强器增强。 如果想测试模型使用增强器的合成效果（可能具有更高的合成质量），请使用下一章中描述的方法。
+## 6. 测试
+（**推荐**）使用预训练声码器增强 DDSP 的输出结果：
+```bash
+# 默认 enhancer_adaptive_key = 0 正常音域范围内将有更高的音质
+# 设置 enhancer_adaptive_key > 0 可将增强器适配于更高的音域
+python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -e true -eak <enhancer_adaptive_key (semitones)>
+```
+ DDSP 的原始输出结果：
+```bash
+# 速度快，但音质相对较低（像您在tensorboard里听到的那样）
+python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -e false -id <speaker_id>
+```
+关于 f0 提取器和响应阈值的其他选项，参见:
+```bash
+python main.py -h
+```
+更新： 现在支持混合说话人（捏音色）了。您可以使用 “-mix” 选项来设计属于您自己的音色，下面是个例子：
+```bash
+# 将1号说话人和2号说话人的音色按照0.5:0.5的比例混合
+python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -mix "{1:0.5, 2:0.5}" -e true -eak 0
+```
+## 7. HTTP 服务器 和 VST 支持
+用以下命令启动服务器
+```bash
+# 配置在这个 python 文件里面，见注释
+python flask_api.py
+```
+当前支持的 VST 前端:
+https://github.com/zhaohui8969/VST_NetProcess-
+## 8. 感谢
+* [ddsp](https://github.com/magenta/ddsp)
+* [pc-ddsp](https://github.com/yxlllc/pc-ddsp)
+* [soft-vc](https://github.com/bshall/soft-vc)
+* [DiffSinger (OpenVPI version)](https://github.com/openvpi/DiffSinger)

configs/combsub-old.yaml ADDED Viewed

	@@ -0,0 +1,42 @@

+data:
+  f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
+  f0_min: 65 # about C2
+  f0_max: 800 # about G5
+  sampling_rate: 44100
+  block_size: 512 # Equal to hop_length
+  duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
+  encoder: 'hubertsoft'
+  encoder_sample_rate: 16000
+  encoder_hop_size: 320
+  encoder_out_channels: 256
+  encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
+  train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
+  valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
+model:
+  type: 'CombSub'
+  n_mag_allpass: 256
+  n_mag_harmonic:  512
+  n_mag_noise: 256
+  n_spk: 1 # max number of different speakers
+enhancer:
+    type: 'nsf-hifigan'
+    ckpt: 'pretrain/nsf_hifigan/model'
+loss:
+  fft_min: 256
+  fft_max: 2048
+  n_scale: 4 # rss kernel numbers
+device: cuda
+env:
+  expdir: exp/combsub-test
+  gpu_id: 0
+train:
+  num_workers: 2 # If your cpu and gpu are both very strong, set to 0 may be faster!
+  batch_size: 24
+  cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
+  cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
+  cache_fp16: true
+  epochs: 100000
+  interval_log: 10
+  interval_val: 2000
+  lr: 0.0005
+  weight_decay: 0

configs/combsub.yaml ADDED Viewed

	@@ -0,0 +1,39 @@

+data:
+  f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
+  f0_min: 65 # about C2
+  f0_max: 800 # about G5
+  sampling_rate: 44100
+  block_size: 512 # Equal to hop_length
+  duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
+  encoder: 'hubertsoft' # 'hubertsoft', 'hubertbase' or 'contentvec'
+  encoder_sample_rate: 16000
+  encoder_hop_size: 320
+  encoder_out_channels: 256
+  encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
+  train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
+  valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
+model:
+  type: 'CombSubFast'
+  n_spk: 1 # max number of different speakers
+enhancer:
+    type: 'nsf-hifigan'
+    ckpt: 'pretrain/nsf_hifigan/model'
+loss:
+  fft_min: 256
+  fft_max: 2048
+  n_scale: 4 # rss kernel numbers
+device: cuda
+env:
+  expdir: exp/combsub-test
+  gpu_id: 0
+train:
+  num_workers: 2 # If your cpu and gpu are both very strong, set to 0 may be faster!
+  batch_size: 24
+  cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
+  cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
+  cache_fp16: true
+  epochs: 100000
+  interval_log: 10
+  interval_val: 2000
+  lr: 0.0005
+  weight_decay: 0

configs/sins.yaml ADDED Viewed

	@@ -0,0 +1,42 @@

+data:
+  f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
+  f0_min: 65 # about C2
+  f0_max: 800 # about G5
+  sampling_rate: 44100
+  block_size: 512 # Equal to hop_length
+  duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
+  encoder: 'hubertsoft' # 'hubertsoft', 'hubertbase' or 'contentvec'
+  encoder_sample_rate: 16000
+  encoder_hop_size: 320
+  encoder_out_channels: 256
+  encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
+  train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
+  valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
+model:
+  type: 'Sins'
+  n_harmonics: 128
+  n_mag_allpass: 256
+  n_mag_noise: 256
+  n_spk: 1 # max number of different speakers
+enhancer:
+    type: 'nsf-hifigan'
+    ckpt: 'pretrain/nsf_hifigan/model'
+loss:
+  fft_min: 256
+  fft_max: 2048
+  n_scale: 4 # rss kernel numbers
+device: cuda
+env:
+  expdir: exp/sins-test
+  gpu_id: 0
+train:
+  num_workers: 2 # If your cpu and gpu are both very strong, set to 0 may be faster!
+  batch_size: 24
+  cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
+  cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
+  cache_fp16: true
+  epochs: 100000
+  interval_log: 10
+  interval_val: 2000
+  lr: 0.0005
+  weight_decay: 0

data/train/audio/gitkeep ADDED Viewed

File without changes

data/val/audio/gitkeep ADDED Viewed

File without changes

data_loaders.py ADDED Viewed

	@@ -0,0 +1,230 @@

+import os
+import random
+import numpy as np
+import librosa
+import torch
+import random
+from tqdm import tqdm
+from torch.utils.data import Dataset
+def traverse_dir(
+        root_dir,
+        extension,
+        amount=None,
+        str_include=None,
+        str_exclude=None,
+        is_pure=False,
+        is_sort=False,
+        is_ext=True):
+    file_list = []
+    cnt = 0
+    for root, _, files in os.walk(root_dir):
+        for file in files:
+            if file.endswith(extension):
+                # path
+                mix_path = os.path.join(root, file)
+                pure_path = mix_path[len(root_dir)+1:] if is_pure else mix_path
+                # amount
+                if (amount is not None) and (cnt == amount):
+                    if is_sort:
+                        file_list.sort()
+                    return file_list
+                # check string
+                if (str_include is not None) and (str_include not in pure_path):
+                    continue
+                if (str_exclude is not None) and (str_exclude in pure_path):
+                    continue
+                if not is_ext:
+                    ext = pure_path.split('.')[-1]
+                    pure_path = pure_path[:-(len(ext)+1)]
+                file_list.append(pure_path)
+                cnt += 1
+    if is_sort:
+        file_list.sort()
+    return file_list
+def get_data_loaders(args, whole_audio=False):
+    data_train = AudioDataset(
+        args.data.train_path,
+        waveform_sec=args.data.duration,
+        hop_size=args.data.block_size,
+        sample_rate=args.data.sampling_rate,
+        load_all_data=args.train.cache_all_data,
+        whole_audio=whole_audio,
+        n_spk=args.model.n_spk,
+        device=args.train.cache_device,
+        fp16=args.train.cache_fp16)
+    loader_train = torch.utils.data.DataLoader(
+        data_train ,
+        batch_size=args.train.batch_size if not whole_audio else 1,
+        shuffle=True,
+        num_workers=args.train.num_workers if args.train.cache_device=='cpu' else 0,
+        persistent_workers=(args.train.num_workers > 0) if args.train.cache_device=='cpu' else False,
+        pin_memory=True if args.train.cache_device=='cpu' else False
+    )
+    data_valid = AudioDataset(
+        args.data.valid_path,
+        waveform_sec=args.data.duration,
+        hop_size=args.data.block_size,
+        sample_rate=args.data.sampling_rate,
+        load_all_data=args.train.cache_all_data,
+        whole_audio=True,
+        n_spk=args.model.n_spk)
+    loader_valid = torch.utils.data.DataLoader(
+        data_valid,
+        batch_size=1,
+        shuffle=False,
+        num_workers=0,
+        pin_memory=True
+    )
+    return loader_train, loader_valid
+class AudioDataset(Dataset):
+    def __init__(
+        self,
+        path_root,
+        waveform_sec,
+        hop_size,
+        sample_rate,
+        load_all_data=True,
+        whole_audio=False,
+        n_spk=1,
+        device = 'cpu',
+        fp16 = False
+    ):
+        super().__init__()
+        self.waveform_sec = waveform_sec
+        self.sample_rate = sample_rate
+        self.hop_size = hop_size
+        self.path_root = path_root
+        self.paths = traverse_dir(
+            os.path.join(path_root, 'audio'),
+            extension='wav',
+            is_pure=True,
+            is_sort=True,
+            is_ext=False
+        )
+        self.whole_audio = whole_audio
+        self.data_buffer={}
+        if load_all_data:
+            print('Load all the data from :', path_root)
+        else:
+            print('Load the f0, volume data from :', path_root)
+        for name in tqdm(self.paths, total=len(self.paths)):
+            path_audio = os.path.join(self.path_root, 'audio', name) + '.wav'
+            duration = librosa.get_duration(filename = path_audio, sr = self.sample_rate)
+            path_f0 = os.path.join(self.path_root, 'f0', name) + '.npy'
+            f0 = np.load(path_f0)
+            f0 = torch.from_numpy(f0).float().unsqueeze(-1).to(device)
+            path_volume = os.path.join(self.path_root, 'volume', name) + '.npy'
+            volume = np.load(path_volume)
+            volume = torch.from_numpy(volume).float().unsqueeze(-1).to(device)
+            if n_spk is not None and n_spk > 1:
+                spk_id = int(os.path.dirname(name)) if str.isdigit(os.path.dirname(name)) else 0
+                if spk_id < 1 or spk_id > n_spk:
+                    raise ValueError(' [x] Muiti-speaker traing error : spk_id must be a positive integer from 1 to n_spk ')
+            else:
+                spk_id = 1
+            spk_id = torch.LongTensor(np.array([spk_id])).to(device)
+            if load_all_data:
+                audio, sr = librosa.load(path_audio, sr=self.sample_rate)
+                if len(audio.shape) > 1:
+                    audio = librosa.to_mono(audio)
+                audio = torch.from_numpy(audio).to(device)
+                path_units = os.path.join(self.path_root, 'units', name) + '.npy'
+                units = np.load(path_units)
+                units = torch.from_numpy(units).to(device)
+                if fp16:
+                    audio = audio.half()
+                    units = units.half()
+                self.data_buffer[name] = {
+                        'duration': duration,
+                        'audio': audio,
+                        'units': units,
+                        'f0': f0,
+                        'volume': volume,
+                        'spk_id': spk_id
+                        }
+            else:
+                self.data_buffer[name] = {
+                        'duration': duration,
+                        'f0': f0,
+                        'volume': volume,
+                        'spk_id': spk_id
+                        }
+    def __getitem__(self, file_idx):
+        name = self.paths[file_idx]
+        data_buffer = self.data_buffer[name]
+        # check duration. if too short, then skip
+        if data_buffer['duration'] < (self.waveform_sec + 0.1):
+            return self.__getitem__( (file_idx + 1) % len(self.paths))
+        # get item
+        return self.get_data(name, data_buffer)
+    def get_data(self, name, data_buffer):
+        frame_resolution = self.hop_size / self.sample_rate
+        duration = data_buffer['duration']
+        waveform_sec = duration if self.whole_audio else self.waveform_sec
+        # load audio
+        idx_from = 0 if self.whole_audio else random.uniform(0, duration - waveform_sec - 0.1)
+        start_frame = int(idx_from / frame_resolution)
+        units_frame_len = int(waveform_sec / frame_resolution)
+        audio = data_buffer.get('audio')
+        if audio is None:
+            path_audio = os.path.join(self.path_root, 'audio', name) + '.wav'
+            audio, sr = librosa.load(
+                    path_audio,
+                    sr = self.sample_rate,
+                    offset = start_frame * frame_resolution,
+                    duration = waveform_sec)
+            if len(audio.shape) > 1:
+                audio = librosa.to_mono(audio)
+            # clip audio into N seconds
+            audio = audio[ : audio.shape[-1] // self.hop_size * self.hop_size]
+            audio = torch.from_numpy(audio).float()
+        else:
+            audio = audio[start_frame * self.hop_size : (start_frame + units_frame_len) * self.hop_size]
+        # load units
+        units = data_buffer.get('units')
+        if units is None:
+            units  = os.path.join(self.path_root, 'units', name) + '.npy'
+            units = np.load(units)
+            units = units[start_frame : start_frame + units_frame_len]
+            units = torch.from_numpy(units).float()
+        else:
+            units = units[start_frame : start_frame + units_frame_len]
+        # load f0
+        f0 = data_buffer.get('f0')
+        f0_frames = f0[start_frame : start_frame + units_frame_len]
+        # load volume
+        volume = data_buffer.get('volume')
+        volume_frames = volume[start_frame : start_frame + units_frame_len]
+        # load spk_id
+        spk_id = data_buffer.get('spk_id')
+        return dict(audio=audio, f0=f0_frames, volume=volume_frames, units=units, spk_id=spk_id, name=name)
+    def __len__(self):
+        return len(self.paths)

ddsp/__init__.py ADDED Viewed

File without changes

ddsp/core.py ADDED Viewed

	@@ -0,0 +1,242 @@

+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+import math
+import numpy as np
+def get_fft_size(frame_size: int, ir_size: int, power_of_2: bool = True):
+  """Calculate final size for efficient FFT.
+  Args:
+    frame_size: Size of the audio frame.
+    ir_size: Size of the convolving impulse response.
+    power_of_2: Constrain to be a power of 2. If False, allow other 5-smooth
+      numbers. TPU requires power of 2, while GPU is more flexible.
+  Returns:
+    fft_size: Size for efficient FFT.
+  """
+  convolved_frame_size = ir_size + frame_size - 1
+  if power_of_2:
+    # Next power of 2.
+    fft_size = int(2**np.ceil(np.log2(convolved_frame_size)))
+  else:
+    fft_size = convolved_frame_size
+  return fft_size
+def upsample(signal, factor):
+    signal = signal.permute(0, 2, 1)
+    signal = nn.functional.interpolate(torch.cat((signal,signal[:,:,-1:]),2), size=signal.shape[-1] * factor + 1, mode='linear', align_corners=True)
+    signal = signal[:,:,:-1]
+    return signal.permute(0, 2, 1)
+def remove_above_fmax(amplitudes, pitch, fmax, level_start=1):
+    n_harm = amplitudes.shape[-1]
+    pitches = pitch * torch.arange(level_start, n_harm + level_start).to(pitch)
+    aa = (pitches < fmax).float() + 1e-7
+    return amplitudes * aa
+def crop_and_compensate_delay(audio, audio_size, ir_size,
+                              padding = 'same',
+                              delay_compensation = -1):
+  """Crop audio output from convolution to compensate for group delay.
+  Args:
+    audio: Audio after convolution. Tensor of shape [batch, time_steps].
+    audio_size: Initial size of the audio before convolution.
+    ir_size: Size of the convolving impulse response.
+    padding: Either 'valid' or 'same'. For 'same' the final output to be the
+      same size as the input audio (audio_timesteps). For 'valid' the audio is
+      extended to include the tail of the impulse response (audio_timesteps +
+      ir_timesteps - 1).
+    delay_compensation: Samples to crop from start of output audio to compensate
+      for group delay of the impulse response. If delay_compensation < 0 it
+      defaults to automatically calculating a constant group delay of the
+      windowed linear phase filter from frequency_impulse_response().
+  Returns:
+    Tensor of cropped and shifted audio.
+  Raises:
+    ValueError: If padding is not either 'valid' or 'same'.
+  """
+  # Crop the output.
+  if padding == 'valid':
+    crop_size = ir_size + audio_size - 1
+  elif padding == 'same':
+    crop_size = audio_size
+  else:
+    raise ValueError('Padding must be \'valid\' or \'same\', instead '
+                     'of {}.'.format(padding))
+  # Compensate for the group delay of the filter by trimming the front.
+  # For an impulse response produced by frequency_impulse_response(),
+  # the group delay is constant because the filter is linear phase.
+  total_size = int(audio.shape[-1])
+  crop = total_size - crop_size
+  start = (ir_size // 2 if delay_compensation < 0 else delay_compensation)
+  end = crop - start
+  return audio[:, start:-end]
+def fft_convolve(audio,
+                 impulse_response): # B, n_frames, 2*(n_mags-1)
+    """Filter audio with frames of time-varying impulse responses.
+    Time-varying filter. Given audio [batch, n_samples], and a series of impulse
+    responses [batch, n_frames, n_impulse_response], splits the audio into frames,
+    applies filters, and then overlap-and-adds audio back together.
+    Applies non-windowed non-overlapping STFT/ISTFT to efficiently compute
+    convolution for large impulse response sizes.
+    Args:
+        audio: Input audio. Tensor of shape [batch, audio_timesteps].
+        impulse_response: Finite impulse response to convolve. Can either be a 2-D
+        Tensor of shape [batch, ir_size], or a 3-D Tensor of shape [batch,
+        ir_frames, ir_size]. A 2-D tensor will apply a single linear
+        time-invariant filter to the audio. A 3-D Tensor will apply a linear
+        time-varying filter. Automatically chops the audio into equally shaped
+        blocks to match ir_frames.
+    Returns:
+        audio_out: Convolved audio. Tensor of shape
+            [batch, audio_timesteps].
+    """
+    # Add a frame dimension to impulse response if it doesn't have one.
+    ir_shape = impulse_response.size()
+    if len(ir_shape) == 2:
+        impulse_response = impulse_response.unsqueeze(1)
+        ir_shape = impulse_response.size()
+    # Get shapes of audio and impulse response.
+    batch_size_ir, n_ir_frames, ir_size = ir_shape
+    batch_size, audio_size = audio.size() # B, T
+    # Validate that batch sizes match.
+    if batch_size != batch_size_ir:
+        raise ValueError('Batch size of audio ({}) and impulse response ({}) must '
+                        'be the same.'.format(batch_size, batch_size_ir))
+    # Cut audio into 50% overlapped frames (center padding).
+    hop_size = int(audio_size / n_ir_frames)
+    frame_size = 2 * hop_size
+    audio_frames = F.pad(audio, (hop_size, hop_size)).unfold(1, frame_size, hop_size)
+    # Apply Bartlett (triangular) window
+    window = torch.bartlett_window(frame_size).to(audio_frames)
+    audio_frames = audio_frames * window
+    # Pad and FFT the audio and impulse responses.
+    fft_size = get_fft_size(frame_size, ir_size, power_of_2=False)
+    audio_fft = torch.fft.rfft(audio_frames, fft_size)
+    ir_fft = torch.fft.rfft(torch.cat((impulse_response,impulse_response[:,-1:,:]),1), fft_size)
+    # Multiply the FFTs (same as convolution in time).
+    audio_ir_fft = torch.multiply(audio_fft, ir_fft)
+    # Take the IFFT to resynthesize audio.
+    audio_frames_out = torch.fft.irfft(audio_ir_fft, fft_size)
+    # Overlap Add
+    batch_size, n_audio_frames, frame_size = audio_frames_out.size() # # B, n_frames+1, 2*(hop_size+n_mags-1)-1
+    fold = torch.nn.Fold(output_size=(1, (n_audio_frames - 1) * hop_size + frame_size),kernel_size=(1, frame_size),stride=(1, hop_size))
+    output_signal = fold(audio_frames_out.transpose(1, 2)).squeeze(1).squeeze(1)
+    # Crop and shift the output audio.
+    output_signal = crop_and_compensate_delay(output_signal[:,hop_size:], audio_size, ir_size)
+    return output_signal
+def apply_window_to_impulse_response(impulse_response, # B, n_frames, 2*(n_mag-1)
+                                     window_size: int = 0,
+                                     causal: bool = False):
+    """Apply a window to an impulse response and put in causal form.
+    Args:
+        impulse_response: A series of impulse responses frames to window, of shape
+        [batch, n_frames, ir_size]. ---------> ir_size means size of filter_bank ??????
+        window_size: Size of the window to apply in the time domain. If window_size
+        is less than 1, it defaults to the impulse_response size.
+        causal: Impulse response input is in causal form (peak in the middle).
+    Returns:
+        impulse_response: Windowed impulse response in causal form, with last
+        dimension cropped to window_size if window_size is greater than 0 and less
+        than ir_size.
+    """
+    # If IR is in causal form, put it in zero-phase form.
+    if causal:
+        impulse_response = torch.fftshift(impulse_response, axes=-1)
+    # Get a window for better time/frequency resolution than rectangular.
+    # Window defaults to IR size, cannot be bigger.
+    ir_size = int(impulse_response.size(-1))
+    if (window_size <= 0) or (window_size > ir_size):
+        window_size = ir_size
+    window = nn.Parameter(torch.hann_window(window_size), requires_grad = False).to(impulse_response)
+    # Zero pad the window and put in in zero-phase form.
+    padding = ir_size - window_size
+    if padding > 0:
+        half_idx = (window_size + 1) // 2
+        window = torch.cat([window[half_idx:],
+                            torch.zeros([padding]),
+                            window[:half_idx]], axis=0)
+    else:
+        window = window.roll(window.size(-1)//2, -1)
+    # Apply the window, to get new IR (both in zero-phase form).
+    window = window.unsqueeze(0)
+    impulse_response = impulse_response * window
+    # Put IR in causal form and trim zero padding.
+    if padding > 0:
+        first_half_start = (ir_size - (half_idx - 1)) + 1
+        second_half_end = half_idx + 1
+        impulse_response = torch.cat([impulse_response[..., first_half_start:],
+                                    impulse_response[..., :second_half_end]],
+                                    dim=-1)
+    else:
+        impulse_response = impulse_response.roll(impulse_response.size(-1)//2, -1)
+    return impulse_response
+def apply_dynamic_window_to_impulse_response(impulse_response,  # B, n_frames, 2*(n_mag-1) or 2*n_mag-1
+                                             half_width_frames):        # B，n_frames, 1
+    ir_size = int(impulse_response.size(-1)) # 2*(n_mag -1) or 2*n_mag-1
+    window = torch.arange(-(ir_size // 2), (ir_size + 1) // 2).to(impulse_response) / half_width_frames
+    window[window > 1] = 0
+    window = (1 + torch.cos(np.pi * window)) / 2 # B, n_frames, 2*(n_mag -1) or 2*n_mag-1
+    impulse_response = impulse_response.roll(ir_size // 2, -1)
+    impulse_response = impulse_response * window
+    return impulse_response
+def frequency_impulse_response(magnitudes,
+                               hann_window = True,
+                               half_width_frames = None):
+    # Get the IR
+    impulse_response = torch.fft.irfft(magnitudes) # B, n_frames, 2*(n_mags-1)
+    # Window and put in causal form.
+    if hann_window:
+        if half_width_frames is None:
+            impulse_response = apply_window_to_impulse_response(impulse_response)
+        else:
+            impulse_response = apply_dynamic_window_to_impulse_response(impulse_response, half_width_frames)
+    else:
+        impulse_response = impulse_response.roll(impulse_response.size(-1) // 2, -1)
+    return impulse_response
+def frequency_filter(audio,
+                     magnitudes,
+                     hann_window=True,
+                     half_width_frames=None):
+    impulse_response = frequency_impulse_response(magnitudes, hann_window, half_width_frames)
+    return fft_convolve(audio, impulse_response)

ddsp/loss.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import numpy as np
+import torch
+import torch.nn as nn
+import torchaudio
+from torch.nn import functional as F
+from .core import upsample
+class SSSLoss(nn.Module):
+    """
+    Single-scale Spectral Loss.
+    """
+    def __init__(self, n_fft=111, alpha=1.0, overlap=0, eps=1e-7):
+        super().__init__()
+        self.n_fft = n_fft
+        self.alpha = alpha
+        self.eps = eps
+        self.hop_length = int(n_fft * (1 - overlap))  # 25% of the length
+        self.spec = torchaudio.transforms.Spectrogram(n_fft=self.n_fft, hop_length=self.hop_length, power=1, normalized=True, center=False)
+    def forward(self, x_true, x_pred):
+        S_true = self.spec(x_true) + self.eps
+        S_pred = self.spec(x_pred) + self.eps
+        converge_term = torch.mean(torch.linalg.norm(S_true - S_pred, dim = (1, 2)) / torch.linalg.norm(S_true + S_pred, dim = (1, 2)))
+        log_term = F.l1_loss(S_true.log(), S_pred.log())
+        loss = converge_term + self.alpha * log_term
+        return loss
+class RSSLoss(nn.Module):
+    '''
+    Random-scale Spectral Loss.
+    '''
+    def __init__(self, fft_min, fft_max, n_scale, alpha=1.0, overlap=0, eps=1e-7, device='cuda'):
+        super().__init__()
+        self.fft_min = fft_min
+        self.fft_max = fft_max
+        self.n_scale = n_scale
+        self.lossdict = {}
+        for n_fft in range(fft_min, fft_max):
+            self.lossdict[n_fft] = SSSLoss(n_fft, alpha, overlap, eps).to(device)
+    def forward(self, x_pred, x_true):
+        value = 0.
+        n_ffts = torch.randint(self.fft_min, self.fft_max, (self.n_scale,))
+        for n_fft in n_ffts:
+            loss_func = self.lossdict[int(n_fft)]
+            value += loss_func(x_true, x_pred)
+        return value / self.n_scale

ddsp/pcmer.py ADDED Viewed

	@@ -0,0 +1,380 @@

+import torch
+from torch import nn
+import math
+from functools import partial
+from einops import rearrange, repeat
+from local_attention import LocalAttention
+import torch.nn.functional as F
+#import fast_transformers.causal_product.causal_product_cuda
+def softmax_kernel(data, *, projection_matrix, is_query, normalize_data=True, eps=1e-4, device = None):
+    b, h, *_ = data.shape
+    # (batch size, head, length, model_dim)
+    # normalize model dim
+    data_normalizer = (data.shape[-1] ** -0.25) if normalize_data else 1.
+    # what is ration?, projection_matrix.shape[0] --> 266
+    ratio = (projection_matrix.shape[0] ** -0.5)
+    projection = repeat(projection_matrix, 'j d -> b h j d', b = b, h = h)
+    projection = projection.type_as(data)
+    #data_dash = w^T x
+    data_dash = torch.einsum('...id,...jd->...ij', (data_normalizer * data), projection)
+    # diag_data = D**2
+    diag_data = data ** 2
+    diag_data = torch.sum(diag_data, dim=-1)
+    diag_data = (diag_data / 2.0) * (data_normalizer ** 2)
+    diag_data = diag_data.unsqueeze(dim=-1)
+    #print ()
+    if is_query:
+        data_dash = ratio * (
+            torch.exp(data_dash - diag_data -
+                    torch.max(data_dash, dim=-1, keepdim=True).values) + eps)
+    else:
+        data_dash = ratio * (
+            torch.exp(data_dash - diag_data + eps))#- torch.max(data_dash)) + eps)
+    return data_dash.type_as(data)
+def orthogonal_matrix_chunk(cols, qr_uniform_q = False, device = None):
+    unstructured_block = torch.randn((cols, cols), device = device)
+    q, r = torch.linalg.qr(unstructured_block.cpu(), mode='reduced')
+    q, r = map(lambda t: t.to(device), (q, r))
+    # proposed by @Parskatt
+    # to make sure Q is uniform https://arxiv.org/pdf/math-ph/0609050.pdf
+    if qr_uniform_q:
+        d = torch.diag(r, 0)
+        q *= d.sign()
+    return q.t()
+def exists(val):
+    return val is not None
+def empty(tensor):
+    return tensor.numel() == 0
+def default(val, d):
+    return val if exists(val) else d
+def cast_tuple(val):
+    return (val,) if not isinstance(val, tuple) else val
+class PCmer(nn.Module):
+    """The encoder that is used in the Transformer model."""
+    def __init__(self,
+                num_layers,
+                num_heads,
+                dim_model,
+                dim_keys,
+                dim_values,
+                residual_dropout,
+                attention_dropout):
+        super().__init__()
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self.dim_model = dim_model
+        self.dim_values = dim_values
+        self.dim_keys = dim_keys
+        self.residual_dropout = residual_dropout
+        self.attention_dropout = attention_dropout
+        self._layers = nn.ModuleList([_EncoderLayer(self) for _ in range(num_layers)])
+    #  METHODS  ########################################################################################################
+    def forward(self, phone, mask=None):
+        # apply all layers to the input
+        for (i, layer) in enumerate(self._layers):
+            phone = layer(phone, mask)
+        # provide the final sequence
+        return phone
+# ==================================================================================================================== #
+#  CLASS  _ E N C O D E R  L A Y E R                                                                                   #
+# ==================================================================================================================== #
+class _EncoderLayer(nn.Module):
+    """One layer of the encoder.
+    Attributes:
+        attn: (:class:`mha.MultiHeadAttention`): The attention mechanism that is used to read the input sequence.
+        feed_forward (:class:`ffl.FeedForwardLayer`): The feed-forward layer on top of the attention mechanism.
+    """
+    def __init__(self, parent: PCmer):
+        """Creates a new instance of ``_EncoderLayer``.
+        Args:
+            parent (Encoder): The encoder that the layers is created for.
+        """
+        super().__init__()
+        self.conformer = ConformerConvModule(parent.dim_model)
+        self.norm = nn.LayerNorm(parent.dim_model)
+        self.dropout = nn.Dropout(parent.residual_dropout)
+        # selfatt -> fastatt: performer!
+        self.attn = SelfAttention(dim = parent.dim_model,
+                                  heads = parent.num_heads,
+                                  causal = False)
+    #  METHODS  ########################################################################################################
+    def forward(self, phone, mask=None):
+        # compute attention sub-layer
+        phone = phone + (self.attn(self.norm(phone), mask=mask))
+        phone = phone + (self.conformer(phone))
+        return phone
+def calc_same_padding(kernel_size):
+    pad = kernel_size // 2
+    return (pad, pad - (kernel_size + 1) % 2)
+# helper classes
+class Swish(nn.Module):
+    def forward(self, x):
+        return x * x.sigmoid()
+class Transpose(nn.Module):
+    def __init__(self, dims):
+        super().__init__()
+        assert len(dims) == 2, 'dims must be a tuple of two dimensions'
+        self.dims = dims
+    def forward(self, x):
+        return x.transpose(*self.dims)
+class GLU(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+    def forward(self, x):
+        out, gate = x.chunk(2, dim=self.dim)
+        return out * gate.sigmoid()
+class DepthWiseConv1d(nn.Module):
+    def __init__(self, chan_in, chan_out, kernel_size, padding):
+        super().__init__()
+        self.padding = padding
+        self.conv = nn.Conv1d(chan_in, chan_out, kernel_size, groups = chan_in)
+    def forward(self, x):
+        x = F.pad(x, self.padding)
+        return self.conv(x)
+class ConformerConvModule(nn.Module):
+    def __init__(
+        self,
+        dim,
+        causal = False,
+        expansion_factor = 2,
+        kernel_size = 31,
+        dropout = 0.):
+        super().__init__()
+        inner_dim = dim * expansion_factor
+        padding = calc_same_padding(kernel_size) if not causal else (kernel_size - 1, 0)
+        self.net = nn.Sequential(
+            nn.LayerNorm(dim),
+            Transpose((1, 2)),
+            nn.Conv1d(dim, inner_dim * 2, 1),
+            GLU(dim=1),
+            DepthWiseConv1d(inner_dim, inner_dim, kernel_size = kernel_size, padding = padding),
+            #nn.BatchNorm1d(inner_dim) if not causal else nn.Identity(),
+            Swish(),
+            nn.Conv1d(inner_dim, dim, 1),
+            Transpose((1, 2)),
+            nn.Dropout(dropout)
+        )
+    def forward(self, x):
+        return self.net(x)
+def linear_attention(q, k, v):
+    if v is None:
+        #print (k.size(), q.size())
+        out = torch.einsum('...ed,...nd->...ne', k, q)
+        return out
+    else:
+        k_cumsum = k.sum(dim = -2)
+        #k_cumsum = k.sum(dim = -2)
+        D_inv = 1. / (torch.einsum('...nd,...d->...n', q, k_cumsum.type_as(q)) + 1e-8)
+        context = torch.einsum('...nd,...ne->...de', k, v)
+        #print ("TRUEEE: ", context.size(), q.size(), D_inv.size())
+        out = torch.einsum('...de,...nd,...n->...ne', context, q, D_inv)
+        return out
+def gaussian_orthogonal_random_matrix(nb_rows, nb_columns, scaling = 0, qr_uniform_q = False, device = None):
+    nb_full_blocks = int(nb_rows / nb_columns)
+    #print (nb_full_blocks)
+    block_list = []
+    for _ in range(nb_full_blocks):
+        q = orthogonal_matrix_chunk(nb_columns, qr_uniform_q = qr_uniform_q, device = device)
+        block_list.append(q)
+    # block_list[n] is a orthogonal matrix ... (model_dim * model_dim)
+    #print (block_list[0].size(), torch.einsum('...nd,...nd->...n', block_list[0], torch.roll(block_list[0],1,1)))
+    #print (nb_rows, nb_full_blocks, nb_columns)
+    remaining_rows = nb_rows - nb_full_blocks * nb_columns
+    #print (remaining_rows)
+    if remaining_rows > 0:
+        q = orthogonal_matrix_chunk(nb_columns, qr_uniform_q = qr_uniform_q, device = device)
+        #print (q[:remaining_rows].size())
+        block_list.append(q[:remaining_rows])
+    final_matrix = torch.cat(block_list)
+    if scaling == 0:
+        multiplier = torch.randn((nb_rows, nb_columns), device = device).norm(dim = 1)
+    elif scaling == 1:
+        multiplier = math.sqrt((float(nb_columns))) * torch.ones((nb_rows,), device = device)
+    else:
+        raise ValueError(f'Invalid scaling {scaling}')
+    return torch.diag(multiplier) @ final_matrix
+class FastAttention(nn.Module):
+    def __init__(self, dim_heads, nb_features = None, ortho_scaling = 0, causal = False, generalized_attention = False, kernel_fn = nn.ReLU(), qr_uniform_q = False, no_projection = False):
+        super().__init__()
+        nb_features = default(nb_features, int(dim_heads * math.log(dim_heads)))
+        self.dim_heads = dim_heads
+        self.nb_features = nb_features
+        self.ortho_scaling = ortho_scaling
+        self.create_projection = partial(gaussian_orthogonal_random_matrix, nb_rows = self.nb_features, nb_columns = dim_heads, scaling = ortho_scaling, qr_uniform_q = qr_uniform_q)
+        projection_matrix = self.create_projection()
+        self.register_buffer('projection_matrix', projection_matrix)
+        self.generalized_attention = generalized_attention
+        self.kernel_fn = kernel_fn
+        # if this is turned on, no projection will be used
+        # queries and keys will be softmax-ed as in the original efficient attention paper
+        self.no_projection = no_projection
+        self.causal = causal
+        if causal:
+            try:
+                import fast_transformers.causal_product.causal_product_cuda
+                self.causal_linear_fn = partial(causal_linear_attention)
+            except ImportError:
+                print('unable to import cuda code for auto-regressive Performer. will default to the memory inefficient non-cuda version')
+                self.causal_linear_fn = causal_linear_attention_noncuda
+    @torch.no_grad()
+    def redraw_projection_matrix(self):
+        projections = self.create_projection()
+        self.projection_matrix.copy_(projections)
+        del projections
+    def forward(self, q, k, v):
+        device = q.device
+        if self.no_projection:
+            q = q.softmax(dim = -1)
+            k = torch.exp(k) if self.causal else k.softmax(dim = -2)
+        elif self.generalized_attention:
+            create_kernel = partial(generalized_kernel, kernel_fn = self.kernel_fn, projection_matrix = self.projection_matrix, device = device)
+            q, k = map(create_kernel, (q, k))
+        else:
+            create_kernel = partial(softmax_kernel, projection_matrix = self.projection_matrix, device = device)
+            q = create_kernel(q, is_query = True)
+            k = create_kernel(k, is_query = False)
+        attn_fn = linear_attention if not self.causal else self.causal_linear_fn
+        if v is None:
+            out = attn_fn(q, k, None)
+            return out
+        else:
+            out = attn_fn(q, k, v)
+            return out
+class SelfAttention(nn.Module):
+    def __init__(self, dim, causal = False, heads = 8, dim_head = 64, local_heads = 0, local_window_size = 256, nb_features = None, feature_redraw_interval = 1000, generalized_attention = False, kernel_fn = nn.ReLU(), qr_uniform_q = False, dropout = 0., no_projection = False):
+        super().__init__()
+        assert dim % heads == 0, 'dimension must be divisible by number of heads'
+        dim_head = default(dim_head, dim // heads)
+        inner_dim = dim_head * heads
+        self.fast_attention = FastAttention(dim_head, nb_features, causal = causal, generalized_attention = generalized_attention, kernel_fn = kernel_fn, qr_uniform_q = qr_uniform_q, no_projection = no_projection)
+        self.heads = heads
+        self.global_heads = heads - local_heads
+        self.local_attn = LocalAttention(window_size = local_window_size, causal = causal, autopad = True, dropout = dropout, look_forward = int(not causal), rel_pos_emb_config = (dim_head, local_heads)) if local_heads > 0 else None
+        #print (heads, nb_features, dim_head)
+        #name_embedding = torch.zeros(110, heads, dim_head, dim_head)
+        #self.name_embedding = nn.Parameter(name_embedding, requires_grad=True)
+        self.to_q = nn.Linear(dim, inner_dim)
+        self.to_k = nn.Linear(dim, inner_dim)
+        self.to_v = nn.Linear(dim, inner_dim)
+        self.to_out = nn.Linear(inner_dim, dim)
+        self.dropout = nn.Dropout(dropout)
+    @torch.no_grad()
+    def redraw_projection_matrix(self):
+        self.fast_attention.redraw_projection_matrix()
+        #torch.nn.init.zeros_(self.name_embedding)
+        #print (torch.sum(self.name_embedding))
+    def forward(self, x, context = None, mask = None, context_mask = None, name=None, inference=False, **kwargs):
+        b, n, _, h, gh = *x.shape, self.heads, self.global_heads
+        cross_attend = exists(context)
+        context = default(context, x)
+        context_mask = default(context_mask, mask) if not cross_attend else context_mask
+        #print (torch.sum(self.name_embedding))
+        q, k, v = self.to_q(x), self.to_k(context), self.to_v(context)
+        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), (q, k, v))
+        (q, lq), (k, lk), (v, lv) = map(lambda t: (t[:, :gh], t[:, gh:]), (q, k, v))
+        attn_outs = []
+        #print (name)
+        #print (self.name_embedding[name].size())
+        if not empty(q):
+            if exists(context_mask):
+                global_mask = context_mask[:, None, :, None]
+                v.masked_fill_(~global_mask, 0.)
+            if cross_attend:
+                pass
+                #print (torch.sum(self.name_embedding))
+                #out = self.fast_attention(q,self.name_embedding[name],None)
+                #print (torch.sum(self.name_embedding[...,-1:]))
+            else:
+                out = self.fast_attention(q, k, v)
+            attn_outs.append(out)
+        if not empty(lq):
+            assert not cross_attend, 'local attention is not compatible with cross attention'
+            out = self.local_attn(lq, lk, lv, input_mask = mask)
+            attn_outs.append(out)
+        out = torch.cat(attn_outs, dim = 1)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        out =  self.to_out(out)
+        return self.dropout(out)

ddsp/unit2control.py ADDED Viewed

	@@ -0,0 +1,86 @@

+import gin
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.nn.utils import weight_norm
+from .pcmer import PCmer
+def split_to_dict(tensor, tensor_splits):
+    """Split a tensor into a dictionary of multiple tensors."""
+    labels = []
+    sizes = []
+    for k, v in tensor_splits.items():
+        labels.append(k)
+        sizes.append(v)
+    tensors = torch.split(tensor, sizes, dim=-1)
+    return dict(zip(labels, tensors))
+class Unit2Control(nn.Module):
+    def __init__(
+            self,
+            input_channel,
+            n_spk,
+            output_splits):
+        super().__init__()
+        self.output_splits = output_splits
+        self.f0_embed = nn.Linear(1, 256)
+        self.phase_embed = nn.Linear(1, 256)
+        self.volume_embed = nn.Linear(1, 256)
+        self.n_spk = n_spk
+        if n_spk is not None and n_spk > 1:
+            self.spk_embed = nn.Embedding(n_spk, 256)
+        # conv in stack
+        self.stack = nn.Sequential(
+                nn.Conv1d(input_channel, 256, 3, 1, 1),
+                nn.GroupNorm(4, 256),
+                nn.LeakyReLU(),
+                nn.Conv1d(256, 256, 3, 1, 1))
+        # transformer
+        self.decoder = PCmer(
+            num_layers=3,
+            num_heads=8,
+            dim_model=256,
+            dim_keys=256,
+            dim_values=256,
+            residual_dropout=0.1,
+            attention_dropout=0.1)
+        self.norm = nn.LayerNorm(256)
+        # out
+        self.n_out = sum([v for k, v in output_splits.items()])
+        self.dense_out = weight_norm(
+            nn.Linear(256, self.n_out))
+    def forward(self, units, f0, phase, volume, spk_id = None, spk_mix_dict = None):
+        '''
+        input:
+            B x n_frames x n_unit
+        return:
+            dict of B x n_frames x feat
+        '''
+        x = self.stack(units.transpose(1,2)).transpose(1,2)
+        x = x + self.f0_embed((1+ f0 / 700).log()) + self.phase_embed(phase / np.pi) + self.volume_embed(volume)
+        if self.n_spk is not None and self.n_spk > 1:
+            if spk_mix_dict is not None:
+                for k, v in spk_mix_dict.items():
+                    spk_id_torch = torch.LongTensor(np.array([[k]])).to(units.device)
+                    x = x + v * self.spk_embed(spk_id_torch - 1)
+            else:
+                x = x + self.spk_embed(spk_id - 1)
+        x = self.decoder(x)
+        x = self.norm(x)
+        e = self.dense_out(x)
+        controls = split_to_dict(e, self.output_splits)
+        return controls

ddsp/vocoder.py ADDED Viewed

	@@ -0,0 +1,515 @@

+import os
+import numpy as np
+import yaml
+import torch
+import torch.nn.functional as F
+import pyworld as pw
+import parselmouth
+import torchcrepe
+import resampy
+from fairseq import checkpoint_utils
+from encoder.hubert.model import HubertSoft
+from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
+from torchaudio.transforms import Resample
+from .unit2control import Unit2Control
+from .core import frequency_filter, upsample, remove_above_fmax
+class F0_Extractor:
+    def __init__(self, f0_extractor, sample_rate = 44100, hop_size = 512, f0_min = 65, f0_max = 800):
+        self.f0_extractor = f0_extractor
+        self.sample_rate = sample_rate
+        self.hop_size = hop_size
+        self.f0_min = f0_min
+        self.f0_max = f0_max
+    def extract(self, audio, uv_interp = False, device = None, silence_front = 0): # audio: 1d numpy array
+        # extractor start time
+        n_frames = int(len(audio) // self.hop_size) + 1
+        start_frame = int(silence_front * self.sample_rate / self.hop_size)
+        real_silence_front = start_frame * self.hop_size / self.sample_rate
+        audio = audio[int(np.round(real_silence_front * self.sample_rate)) : ]
+        # extract f0 using parselmouth
+        if self.f0_extractor == 'parselmouth':
+            f0 = parselmouth.Sound(audio, self.sample_rate).to_pitch_ac(
+                time_step = self.hop_size / self.sample_rate,
+                voicing_threshold = 0.6,
+                pitch_floor = self.f0_min,
+                pitch_ceiling = self.f0_max).selected_array['frequency']
+            pad_size = start_frame + (int(len(audio) // self.hop_size) - len(f0) + 1) // 2
+            f0 = np.pad(f0,(pad_size, n_frames - len(f0) - pad_size))
+        # extract f0 using dio
+        elif self.f0_extractor == 'dio':
+            _f0, t = pw.dio(
+                audio.astype('double'),
+                self.sample_rate,
+                f0_floor = self.f0_min,
+                f0_ceil = self.f0_max,
+                channels_in_octave=2,
+                frame_period = (1000 * self.hop_size / self.sample_rate))
+            f0 = pw.stonemask(audio.astype('double'), _f0, t, self.sample_rate)
+            f0 = np.pad(f0.astype('float'), (start_frame, n_frames - len(f0) - start_frame))
+        # extract f0 using harvest
+        elif self.f0_extractor == 'harvest':
+            f0, _ = pw.harvest(
+                audio.astype('double'),
+                self.sample_rate,
+                f0_floor = self.f0_min,
+                f0_ceil = self.f0_max,
+                frame_period = (1000 * self.hop_size / self.sample_rate))
+            f0 = np.pad(f0.astype('float'), (start_frame, n_frames - len(f0) - start_frame))
+        # extract f0 using crepe
+        elif self.f0_extractor == 'crepe':
+            if device is None:
+                device = 'cuda' if torch.cuda.is_available() else 'cpu'
+            wav16k = resampy.resample(audio, self.sample_rate, 16000)
+            wav16k_torch = torch.FloatTensor(wav16k).unsqueeze(0).to(device)
+            f0, pd = torchcrepe.predict(wav16k_torch, 16000, 80, self.f0_min, self.f0_max, pad=True, model='full', batch_size=512, device=device, return_periodicity=True)
+            pd = torchcrepe.filter.median(pd, 4)
+            pd = torchcrepe.threshold.Silence(-60.)(pd, wav16k_torch, 16000, 80)
+            f0 = torchcrepe.threshold.At(0.05)(f0, pd)
+            f0 = torchcrepe.filter.mean(f0, 4)
+            f0 = torch.where(torch.isnan(f0), torch.full_like(f0, 0), f0)
+            f0 = f0.squeeze(0).cpu().numpy()
+            f0 = np.array([f0[int(min(int(np.round(n * self.hop_size / self.sample_rate / 0.005)), len(f0) - 1))] for n in range(n_frames - start_frame)])
+            f0 = np.pad(f0, (start_frame, 0))
+        else:
+            raise ValueError(f" [x] Unknown f0 extractor: {f0_extractor}")
+        # interpolate the unvoiced f0
+        if uv_interp:
+            uv = f0 == 0
+            if len(f0[~uv]) > 0:
+                f0[uv] = np.interp(np.where(uv)[0], np.where(~uv)[0], f0[~uv])
+            f0[f0 < self.f0_min] = self.f0_min
+        return f0
+class Volume_Extractor:
+    def __init__(self, hop_size = 512):
+        self.hop_size = hop_size
+    def extract(self, audio): # audio: 1d numpy array
+        n_frames = int(len(audio) // self.hop_size) + 1
+        audio2 = audio ** 2
+        audio2 = np.pad(audio2, (int(self.hop_size // 2), int((self.hop_size + 1) // 2)), mode = 'reflect')
+        volume = np.array([np.mean(audio2[int(n * self.hop_size) : int((n + 1) * self.hop_size)]) for n in range(n_frames)])
+        volume = np.sqrt(volume)
+        return volume
+class Units_Encoder:
+    def __init__(self, encoder, encoder_ckpt, encoder_sample_rate = 16000, encoder_hop_size = 320, device = None):
+        if device is None:
+            device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.device = device
+        is_loaded_encoder = False
+        if encoder == 'hubertsoft':
+            self.model = Audio2HubertSoft(encoder_ckpt).to(device)
+            is_loaded_encoder = True
+        if encoder == 'hubertbase':
+            self.model = Audio2HubertBase(encoder_ckpt, device=device)
+            is_loaded_encoder = True
+        if encoder == 'contentvec':
+            self.model = Audio2ContentVec(encoder_ckpt, device=device)
+            is_loaded_encoder = True
+        if not is_loaded_encoder:
+            raise ValueError(f" [x] Unknown units encoder: {encoder}")
+        self.resample_kernel = {}
+        self.encoder_sample_rate = encoder_sample_rate
+        self.encoder_hop_size = encoder_hop_size
+    def encode(self,
+                audio, # B, T
+                sample_rate,
+                hop_size):
+        # resample
+        if sample_rate == self.encoder_sample_rate:
+            audio_res = audio
+        else:
+            key_str = str(sample_rate)
+            if key_str not in self.resample_kernel:
+                self.resample_kernel[key_str] = Resample(sample_rate, self.encoder_sample_rate, lowpass_filter_width = 128).to(self.device)
+            audio_res = self.resample_kernel[key_str](audio)
+        # encode
+        if audio_res.size(-1) < self.encoder_hop_size:
+            audio_res = torch.nn.functional.pad(audio, (0, self.encoder_hop_size - audio_res.size(-1)))
+        units = self.model(audio_res)
+        # alignment
+        n_frames = audio.size(-1) // hop_size + 1
+        ratio = (hop_size / sample_rate) / (self.encoder_hop_size / self.encoder_sample_rate)
+        index = torch.clamp(torch.round(ratio * torch.arange(n_frames).to(self.device)).long(), max = units.size(1) - 1)
+        units_aligned = torch.gather(units, 1, index.unsqueeze(0).unsqueeze(-1).repeat([1, 1, units.size(-1)]))
+        return units_aligned
+class Audio2HubertSoft(torch.nn.Module):
+    def __init__(self, path, h_sample_rate = 16000, h_hop_size = 320):
+        super().__init__()
+        print(' [Encoder Model] HuBERT Soft')
+        self.hubert = HubertSoft()
+        print(' [Loading] ' + path)
+        checkpoint = torch.load(path)
+        consume_prefix_in_state_dict_if_present(checkpoint, "module.")
+        self.hubert.load_state_dict(checkpoint)
+        self.hubert.eval()
+    def forward(self,
+                audio): # B, T
+        with torch.inference_mode():
+            units = self.hubert.units(audio.unsqueeze(1))
+            return units
+class Audio2ContentVec():
+    def __init__(self, path, h_sample_rate=16000, h_hop_size=320, device='cpu'):
+        self.device = device
+        print(' [Encoder Model] Content Vec')
+        print(' [Loading] ' + path)
+        self.models, self.saved_cfg, self.task = checkpoint_utils.load_model_ensemble_and_task([path], suffix="", )
+        self.hubert = self.models[0]
+        self.hubert = self.hubert.to(self.device)
+        self.hubert.eval()
+    def __call__(self,
+                 audio):  # B, T
+        # wav_tensor = torch.from_numpy(audio).to(self.device)
+        wav_tensor = audio
+        feats = wav_tensor.view(1, -1)
+        padding_mask = torch.BoolTensor(feats.shape).fill_(False)
+        inputs = {
+            "source": feats.to(wav_tensor.device),
+            "padding_mask": padding_mask.to(wav_tensor.device),
+            "output_layer": 9,  # layer 9
+        }
+        with torch.no_grad():
+            logits = self.hubert.extract_features(**inputs)
+            feats = self.hubert.final_proj(logits[0])
+        units = feats  # .transpose(2, 1)
+        return units
+class Audio2HubertBase():
+    def __init__(self, path, h_sample_rate=16000, h_hop_size=320, device='cpu'):
+        self.device = device
+        print(' [Encoder Model] HuBERT Base')
+        print(' [Loading] ' + path)
+        self.models, self.saved_cfg, self.task = checkpoint_utils.load_model_ensemble_and_task([path], suffix="", )
+        self.hubert = self.models[0]
+        self.hubert = self.hubert.to(self.device)
+        self.hubert = self.hubert.float()
+        self.hubert.eval()
+    def __call__(self,
+                 audio):  # B, T
+        with torch.no_grad():
+            padding_mask = torch.BoolTensor(audio.shape).fill_(False)
+            inputs = {
+                "source": audio.to(self.device),
+                "padding_mask": padding_mask.to(self.device),
+                "output_layer": 9,  # layer 9
+            }
+            logits = self.hubert.extract_features(**inputs)
+            units = self.hubert.final_proj(logits[0])
+            return units
+class DotDict(dict):
+    def __getattr__(*args):
+        val = dict.get(*args)
+        return DotDict(val) if type(val) is dict else val
+    __setattr__ = dict.__setitem__
+    __delattr__ = dict.__delitem__
+def load_model(
+        model_path,
+        device='cpu'):
+    config_file = os.path.join(os.path.split(model_path)[0], 'config.yaml')
+    with open(config_file, "r") as config:
+        args = yaml.safe_load(config)
+    args = DotDict(args)
+    # load model
+    model = None
+    if args.model.type == 'Sins':
+        model = Sins(
+            sampling_rate=args.data.sampling_rate,
+            block_size=args.data.block_size,
+            n_harmonics=args.model.n_harmonics,
+            n_mag_allpass=args.model.n_mag_allpass,
+            n_mag_noise=args.model.n_mag_noise,
+            n_unit=args.data.encoder_out_channels,
+            n_spk=args.model.n_spk)
+    elif args.model.type == 'CombSub':
+        model = CombSub(
+            sampling_rate=args.data.sampling_rate,
+            block_size=args.data.block_size,
+            n_mag_allpass=args.model.n_mag_allpass,
+            n_mag_harmonic=args.model.n_mag_harmonic,
+            n_mag_noise=args.model.n_mag_noise,
+            n_unit=args.data.encoder_out_channels,
+            n_spk=args.model.n_spk)
+    elif args.model.type == 'CombSubFast':
+        model = CombSubFast(
+            sampling_rate=args.data.sampling_rate,
+            block_size=args.data.block_size,
+            n_unit=args.data.encoder_out_channels,
+            n_spk=args.model.n_spk)
+    else:
+        raise ValueError(f" [x] Unknown Model: {args.model.type}")
+    print(' [Loading] ' + model_path)
+    ckpt = torch.load(model_path, map_location=torch.device(device))
+    model.to(device)
+    model.load_state_dict(ckpt['model'])
+    model.eval()
+    return model, args
+class Sins(torch.nn.Module):
+    def __init__(self,
+            sampling_rate,
+            block_size,
+            n_harmonics,
+            n_mag_allpass,
+            n_mag_noise,
+            n_unit=256,
+            n_spk=1):
+        super().__init__()
+        print(' [DDSP Model] Sinusoids Additive Synthesiser')
+        # params
+        self.register_buffer("sampling_rate", torch.tensor(sampling_rate))
+        self.register_buffer("block_size", torch.tensor(block_size))
+        # Unit2Control
+        split_map = {
+            'amplitudes': n_harmonics,
+            'group_delay': n_mag_allpass,
+            'noise_magnitude': n_mag_noise,
+        }
+        self.unit2ctrl = Unit2Control(n_unit, n_spk, split_map)
+    def forward(self, units_frames, f0_frames, volume_frames, spk_id=None, spk_mix_dict=None, initial_phase=None, infer=True, max_upsample_dim=32):
+        '''
+            units_frames: B x n_frames x n_unit
+            f0_frames: B x n_frames x 1
+            volume_frames: B x n_frames x 1
+            spk_id: B x 1
+        '''
+        # exciter phase
+        f0 = upsample(f0_frames, self.block_size)
+        if infer:
+            x = torch.cumsum(f0.double() / self.sampling_rate, axis=1)
+        else:
+            x = torch.cumsum(f0 / self.sampling_rate, axis=1)
+        if initial_phase is not None:
+            x += initial_phase.to(x) / 2 / np.pi
+        x = x - torch.round(x)
+        x = x.to(f0)
+        phase = 2 * np.pi * x
+        phase_frames = phase[:, ::self.block_size, :]
+        # parameter prediction
+        ctrls = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict)
+        amplitudes_frames = torch.exp(ctrls['amplitudes'])/ 128
+        group_delay = np.pi * torch.tanh(ctrls['group_delay'])
+        noise_param = torch.exp(ctrls['noise_magnitude']) / 128
+        # sinusoids exciter signal
+        amplitudes_frames = remove_above_fmax(amplitudes_frames, f0_frames, self.sampling_rate / 2, level_start = 1)
+        n_harmonic = amplitudes_frames.shape[-1]
+        level_harmonic = torch.arange(1, n_harmonic + 1).to(phase)
+        sinusoids = 0.
+        for n in range(( n_harmonic - 1) // max_upsample_dim + 1):
+            start = n * max_upsample_dim
+            end = (n + 1) * max_upsample_dim
+            phases = phase * level_harmonic[start:end]
+            amplitudes = upsample(amplitudes_frames[:,:,start:end], self.block_size)
+            sinusoids += (torch.sin(phases) * amplitudes).sum(-1)
+        # harmonic part filter (apply group-delay)
+        harmonic = frequency_filter(
+                        sinusoids,
+                        torch.exp(1.j * torch.cumsum(group_delay, axis = -1)),
+                        hann_window = False)
+        # noise part filter
+        noise = torch.rand_like(harmonic) * 2 - 1
+        noise = frequency_filter(
+                        noise,
+                        torch.complex(noise_param, torch.zeros_like(noise_param)),
+                        hann_window = True)
+        signal = harmonic + noise
+        return signal, phase, (harmonic, noise) #, (noise_param, noise_param)
+class CombSubFast(torch.nn.Module):
+    def __init__(self,
+            sampling_rate,
+            block_size,
+            n_unit=256,
+            n_spk=1):
+        super().__init__()
+        print(' [DDSP Model] Combtooth Subtractive Synthesiser')
+        # params
+        self.register_buffer("sampling_rate", torch.tensor(sampling_rate))
+        self.register_buffer("block_size", torch.tensor(block_size))
+        self.register_buffer("window", torch.sqrt(torch.hann_window(2 * block_size)))
+        #Unit2Control
+        split_map = {
+            'harmonic_magnitude': block_size + 1,
+            'harmonic_phase': block_size + 1,
+            'noise_magnitude': block_size + 1
+        }
+        self.unit2ctrl = Unit2Control(n_unit, n_spk, split_map)
+    def forward(self, units_frames, f0_frames, volume_frames, spk_id=None, spk_mix_dict=None, initial_phase=None, infer=True, **kwargs):
+        '''
+            units_frames: B x n_frames x n_unit
+            f0_frames: B x n_frames x 1
+            volume_frames: B x n_frames x 1
+            spk_id: B x 1
+        '''
+        # exciter phase
+        f0 = upsample(f0_frames, self.block_size)
+        if infer:
+            x = torch.cumsum(f0.double() / self.sampling_rate, axis=1)
+        else:
+            x = torch.cumsum(f0 / self.sampling_rate, axis=1)
+        if initial_phase is not None:
+            x += initial_phase.to(x) / 2 / np.pi
+        x = x - torch.round(x)
+        x = x.to(f0)
+        phase_frames = 2 * np.pi * x[:, ::self.block_size, :]
+        # parameter prediction
+        ctrls = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict)
+        src_filter = torch.exp(ctrls['harmonic_magnitude'] + 1.j * np.pi * ctrls['harmonic_phase'])
+        src_filter = torch.cat((src_filter, src_filter[:,-1:,:]), 1)
+        noise_filter= torch.exp(ctrls['noise_magnitude']) / 128
+        noise_filter = torch.cat((noise_filter, noise_filter[:,-1:,:]), 1)
+        # combtooth exciter signal
+        combtooth = torch.sinc(self.sampling_rate * x / (f0 + 1e-3))
+        combtooth = combtooth.squeeze(-1)
+        combtooth_frames = F.pad(combtooth, (self.block_size, self.block_size)).unfold(1, 2 * self.block_size, self.block_size)
+        combtooth_frames = combtooth_frames * self.window
+        combtooth_fft = torch.fft.rfft(combtooth_frames, 2 * self.block_size)
+        # noise exciter signal
+        noise = torch.rand_like(combtooth) * 2 - 1
+        noise_frames = F.pad(noise, (self.block_size, self.block_size)).unfold(1, 2 * self.block_size, self.block_size)
+        noise_frames = noise_frames * self.window
+        noise_fft = torch.fft.rfft(noise_frames, 2 * self.block_size)
+        # apply the filters
+        signal_fft = combtooth_fft * src_filter + noise_fft * noise_filter
+        # take the ifft to resynthesize audio.
+        signal_frames_out = torch.fft.irfft(signal_fft, 2 * self.block_size) * self.window
+        # overlap add
+        fold = torch.nn.Fold(output_size=(1, (signal_frames_out.size(1) + 1) * self.block_size), kernel_size=(1, 2 * self.block_size), stride=(1, self.block_size))
+        signal = fold(signal_frames_out.transpose(1, 2))[:, 0, 0, self.block_size : -self.block_size]
+        return signal, phase_frames, (signal, signal)
+class CombSub(torch.nn.Module):
+    def __init__(self,
+            sampling_rate,
+            block_size,
+            n_mag_allpass,
+            n_mag_harmonic,
+            n_mag_noise,
+            n_unit=256,
+            n_spk=1):
+        super().__init__()
+        print(' [DDSP Model] Combtooth Subtractive Synthesiser (Old Version)')
+        # params
+        self.register_buffer("sampling_rate", torch.tensor(sampling_rate))
+        self.register_buffer("block_size", torch.tensor(block_size))
+        #Unit2Control
+        split_map = {
+            'group_delay': n_mag_allpass,
+            'harmonic_magnitude': n_mag_harmonic,
+            'noise_magnitude': n_mag_noise
+        }
+        self.unit2ctrl = Unit2Control(n_unit, n_spk, split_map)
+    def forward(self, units_frames, f0_frames, volume_frames, spk_id=None, spk_mix_dict=None, initial_phase=None, infer=True, **kwargs):
+        '''
+            units_frames: B x n_frames x n_unit
+            f0_frames: B x n_frames x 1
+            volume_frames: B x n_frames x 1
+            spk_id: B x 1
+        '''
+        # exciter phase
+        f0 = upsample(f0_frames, self.block_size)
+        if infer:
+            x = torch.cumsum(f0.double() / self.sampling_rate, axis=1)
+        else:
+            x = torch.cumsum(f0 / self.sampling_rate, axis=1)
+        if initial_phase is not None:
+            x += initial_phase.to(x) / 2 / np.pi
+        x = x - torch.round(x)
+        x = x.to(f0)
+        phase_frames = 2 * np.pi * x[:, ::self.block_size, :]
+        # parameter prediction
+        ctrls = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict)
+        group_delay = np.pi * torch.tanh(ctrls['group_delay'])
+        src_param = torch.exp(ctrls['harmonic_magnitude'])
+        noise_param = torch.exp(ctrls['noise_magnitude']) / 128
+        # combtooth exciter signal
+        combtooth = torch.sinc(self.sampling_rate * x / (f0 + 1e-3))
+        combtooth = combtooth.squeeze(-1)
+        # harmonic part filter (using dynamic-windowed LTV-FIR, with group-delay prediction)
+        harmonic = frequency_filter(
+                        combtooth,
+                        torch.exp(1.j * torch.cumsum(group_delay, axis = -1)),
+                        hann_window = False)
+        harmonic = frequency_filter(
+                        harmonic,
+                        torch.complex(src_param, torch.zeros_like(src_param)),
+                        hann_window = True,
+                        half_width_frames = 1.5 * self.sampling_rate / (f0_frames + 1e-3))
+        # noise part filter (using constant-windowed LTV-FIR, without group-delay)
+        noise = torch.rand_like(harmonic) * 2 - 1
+        noise = frequency_filter(
+                        noise,
+                        torch.complex(noise_param, torch.zeros_like(noise_param)),
+                        hann_window = True)
+        signal = harmonic + noise
+        return signal, phase_frames, (harmonic, noise)

draw.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import numpy as np
+import tqdm
+import matplotlib.pyplot as plt
+import os
+import shutil
+import wave
+WAV_MIN_LENGTH = 2    # wav文件的最短时长 / The minimum duration of wav files
+SAMPLE_RATE = 1    # 抽取文件数量的百分比 / The percentage of files to be extracted
+SAMPLE_MIN = 2    # 抽取的文件数量下限 / The lower limit of the number of files to be extracted
+SAMPLE_MAX = 10    # 抽取的文件数量上限 / The upper limit of the number of files to be extracted
+# 定义一个函数，用于检查wav文件的时长是否大于最短时长
+def check_duration(wav_file):
+    # 打开wav文件
+    f = wave.open(wav_file, "rb")
+    # 获取帧数和帧率
+    frames = f.getnframes()
+    rate = f.getframerate()
+    # 计算时长（秒）
+    duration = frames / float(rate)
+    # 关闭文件
+    f.close()
+    # 返回时长是否大于最短时长的布尔值
+    return duration > WAV_MIN_LENGTH
+# 定义一个函数，用于从给定的目录中随机抽取一定比例的wav文件，并剪切到另一个目录中，保留数据结构
+def split_data(src_dir, dst_dir, ratio):
+    # 创建目标目录（如果不存在）
+    if not os.path.exists(dst_dir):
+        os.makedirs(dst_dir)
+    # 获取源目录下所有的子目录和文件名（不包括子目录下的内容）
+    subdirs, files = [], []
+    for item in os.listdir(src_dir):
+        item_path = os.path.join(src_dir, item)
+        if os.path.isdir(item_path):
+            subdirs.append(item)
+        elif os.path.isfile(item_path) and item.endswith(".wav"):
+            files.append(item)
+    # 如果源目录下没有任何wav文件，则报错并退出函数
+    if len(files) == 0:
+        print(f"Error: No wav files found in {src_dir}")
+        return
+    # 计算需要抽取的wav文件数量
+    num_files = int(len(files) * ratio)
+    num_files = max(SAMPLE_MIN, min(SAMPLE_MAX, num_files))
+    # 随机打乱文件名列表，并取出前num_files个作为抽取结果
+    np.random.shuffle(files)
+    selected_files = files[:num_files]
+    # 创建一个进度条对象，用于显示程序的运行进度
+    pbar = tqdm.tqdm(total=num_files)
+    # 遍历抽取结果中的每个文件名
+    for file in selected_files:
+        # 拼接源文件和目标文件的完整路径
+        src_file = os.path.join(src_dir, file)
+        dst_file = os.path.join(dst_dir, file)
+        # 检查源文件的时长是否大于2秒
+        if check_duration(src_file):
+            # 如果是，则剪切源文件到目标目录中
+            shutil.move(src_file, dst_file)
+            # 更新进度条
+            pbar.update(1)
+        else:
+            # 如果不是，则打印源文件的文件名，并跳过该文件
+            print(f"Skipped {src_file} because its duration is less than 2 seconds.")
+    # 关闭进度条
+    pbar.close()
+    # 遍历源目录下所有的子目录（如果有）
+    for subdir in subdirs:
+        # 拼接子目录在源目录和目标目录中的完整路径
+        src_subdir = os.path.join(src_dir, subdir)
+        dst_subdir = os.path.join(dst_dir, subdir)
+        # 递归地调用本函数，对子目录中的wav文件进行同样的操作，保留数据结构
+        split_data(src_subdir, dst_subdir, ratio)
+# 定义主函数，用于获取用户输入并调用上述函数
+def main():
+    root_dir = os.path.abspath('.')
+    dst_dir = root_dir + "/data/val/audio"
+    # 抽取比例，默认为1
+    ratio = float(SAMPLE_RATE) / 100
+    # 固定源目录为根目录下/data/train/audio目录
+    src_dir = root_dir + "/data/train/audio"
+    # 调用split_data函数，对源目录中的wav文件进行抽取，并剪切到目标目录中，保留数据结构
+    split_data(src_dir, dst_dir, ratio)
+# 如果本模块是主模块，则执行主函数
+if __name__ == "__main__":
+    main()

encoder/hubert/model.py ADDED Viewed

	@@ -0,0 +1,293 @@

+import copy
+from typing import Optional, Tuple
+import random
+from sklearn.cluster import KMeans
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
+URLS = {
+    "hubert-discrete": "https://github.com/bshall/hubert/releases/download/v0.1/hubert-discrete-e9416457.pt",
+    "hubert-soft": "https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt",
+    "kmeans100": "https://github.com/bshall/hubert/releases/download/v0.1/kmeans100-50f36a95.pt",
+}
+class Hubert(nn.Module):
+    def __init__(self, num_label_embeddings: int = 100, mask: bool = True):
+        super().__init__()
+        self._mask = mask
+        self.feature_extractor = FeatureExtractor()
+        self.feature_projection = FeatureProjection()
+        self.positional_embedding = PositionalConvEmbedding()
+        self.norm = nn.LayerNorm(768)
+        self.dropout = nn.Dropout(0.1)
+        self.encoder = TransformerEncoder(
+            nn.TransformerEncoderLayer(
+                768, 12, 3072, activation="gelu", batch_first=True
+            ),
+            12,
+        )
+        self.proj = nn.Linear(768, 256)
+        self.masked_spec_embed = nn.Parameter(torch.FloatTensor(768).uniform_())
+        self.label_embedding = nn.Embedding(num_label_embeddings, 256)
+    def mask(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        mask = None
+        if self.training and self._mask:
+            mask = _compute_mask((x.size(0), x.size(1)), 0.8, 10, x.device, 2)
+            x[mask] = self.masked_spec_embed.to(x.dtype)
+        return x, mask
+    def encode(
+        self, x: torch.Tensor, layer: Optional[int] = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        x = self.feature_extractor(x)
+        x = self.feature_projection(x.transpose(1, 2))
+        x, mask = self.mask(x)
+        x = x + self.positional_embedding(x)
+        x = self.dropout(self.norm(x))
+        x = self.encoder(x, output_layer=layer)
+        return x, mask
+    def logits(self, x: torch.Tensor) -> torch.Tensor:
+        logits = torch.cosine_similarity(
+            x.unsqueeze(2),
+            self.label_embedding.weight.unsqueeze(0).unsqueeze(0),
+            dim=-1,
+        )
+        return logits / 0.1
+    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        x, mask = self.encode(x)
+        x = self.proj(x)
+        logits = self.logits(x)
+        return logits, mask
+class HubertSoft(Hubert):
+    def __init__(self):
+        super().__init__()
+    @torch.inference_mode()
+    def units(self, wav: torch.Tensor) -> torch.Tensor:
+        wav = F.pad(wav, ((400 - 320) // 2, (400 - 320) // 2))
+        x, _ = self.encode(wav)
+        return self.proj(x)
+class HubertDiscrete(Hubert):
+    def __init__(self, kmeans):
+        super().__init__(504)
+        self.kmeans = kmeans
+    @torch.inference_mode()
+    def units(self, wav: torch.Tensor) -> torch.LongTensor:
+        wav = F.pad(wav, ((400 - 320) // 2, (400 - 320) // 2))
+        x, _ = self.encode(wav, layer=7)
+        x = self.kmeans.predict(x.squeeze().cpu().numpy())
+        return torch.tensor(x, dtype=torch.long, device=wav.device)
+class FeatureExtractor(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv0 = nn.Conv1d(1, 512, 10, 5, bias=False)
+        self.norm0 = nn.GroupNorm(512, 512)
+        self.conv1 = nn.Conv1d(512, 512, 3, 2, bias=False)
+        self.conv2 = nn.Conv1d(512, 512, 3, 2, bias=False)
+        self.conv3 = nn.Conv1d(512, 512, 3, 2, bias=False)
+        self.conv4 = nn.Conv1d(512, 512, 3, 2, bias=False)
+        self.conv5 = nn.Conv1d(512, 512, 2, 2, bias=False)
+        self.conv6 = nn.Conv1d(512, 512, 2, 2, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = F.gelu(self.norm0(self.conv0(x)))
+        x = F.gelu(self.conv1(x))
+        x = F.gelu(self.conv2(x))
+        x = F.gelu(self.conv3(x))
+        x = F.gelu(self.conv4(x))
+        x = F.gelu(self.conv5(x))
+        x = F.gelu(self.conv6(x))
+        return x
+class FeatureProjection(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.norm = nn.LayerNorm(512)
+        self.projection = nn.Linear(512, 768)
+        self.dropout = nn.Dropout(0.1)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.norm(x)
+        x = self.projection(x)
+        x = self.dropout(x)
+        return x
+class PositionalConvEmbedding(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            768,
+            768,
+            kernel_size=128,
+            padding=128 // 2,
+            groups=16,
+        )
+        self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.conv(x.transpose(1, 2))
+        x = F.gelu(x[:, :, :-1])
+        return x.transpose(1, 2)
+class TransformerEncoder(nn.Module):
+    def __init__(
+        self, encoder_layer: nn.TransformerEncoderLayer, num_layers: int
+    ) -> None:
+        super(TransformerEncoder, self).__init__()
+        self.layers = nn.ModuleList(
+            [copy.deepcopy(encoder_layer) for _ in range(num_layers)]
+        )
+        self.num_layers = num_layers
+    def forward(
+        self,
+        src: torch.Tensor,
+        mask: torch.Tensor = None,
+        src_key_padding_mask: torch.Tensor = None,
+        output_layer: Optional[int] = None,
+    ) -> torch.Tensor:
+        output = src
+        for layer in self.layers[:output_layer]:
+            output = layer(
+                output, src_mask=mask, src_key_padding_mask=src_key_padding_mask
+            )
+        return output
+def _compute_mask(
+    shape: Tuple[int, int],
+    mask_prob: float,
+    mask_length: int,
+    device: torch.device,
+    min_masks: int = 0,
+) -> torch.Tensor:
+    batch_size, sequence_length = shape
+    if mask_length < 1:
+        raise ValueError("`mask_length` has to be bigger than 0.")
+    if mask_length > sequence_length:
+        raise ValueError(
+            f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"
+        )
+    # compute number of masked spans in batch
+    num_masked_spans = int(mask_prob * sequence_length / mask_length + random.random())
+    num_masked_spans = max(num_masked_spans, min_masks)
+    # make sure num masked indices <= sequence_length
+    if num_masked_spans * mask_length > sequence_length:
+        num_masked_spans = sequence_length // mask_length
+    # SpecAugment mask to fill
+    mask = torch.zeros((batch_size, sequence_length), device=device, dtype=torch.bool)
+    # uniform distribution to sample from, make sure that offset samples are < sequence_length
+    uniform_dist = torch.ones(
+        (batch_size, sequence_length - (mask_length - 1)), device=device
+    )
+    # get random indices to mask
+    mask_indices = torch.multinomial(uniform_dist, num_masked_spans)
+    # expand masked indices to masked spans
+    mask_indices = (
+        mask_indices.unsqueeze(dim=-1)
+        .expand((batch_size, num_masked_spans, mask_length))
+        .reshape(batch_size, num_masked_spans * mask_length)
+    )
+    offsets = (
+        torch.arange(mask_length, device=device)[None, None, :]
+        .expand((batch_size, num_masked_spans, mask_length))
+        .reshape(batch_size, num_masked_spans * mask_length)
+    )
+    mask_idxs = mask_indices + offsets
+    # scatter indices to mask
+    mask = mask.scatter(1, mask_idxs, True)
+    return mask
+def hubert_discrete(
+    pretrained: bool = True,
+    progress: bool = True,
+) -> HubertDiscrete:
+    r"""HuBERT-Discrete from `"A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion"`.
+    Args:
+        pretrained (bool): load pretrained weights into the model
+        progress (bool): show progress bar when downloading model
+    """
+    kmeans = kmeans100(pretrained=pretrained, progress=progress)
+    hubert = HubertDiscrete(kmeans)
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            URLS["hubert-discrete"], progress=progress
+        )
+        consume_prefix_in_state_dict_if_present(checkpoint, "module.")
+        hubert.load_state_dict(checkpoint)
+        hubert.eval()
+    return hubert
+def hubert_soft(
+    pretrained: bool = True,
+    progress: bool = True,
+) -> HubertSoft:
+    r"""HuBERT-Soft from `"A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion"`.
+    Args:
+        pretrained (bool): load pretrained weights into the model
+        progress (bool): show progress bar when downloading model
+    """
+    hubert = HubertSoft()
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            URLS["hubert-soft"], progress=progress
+        )
+        consume_prefix_in_state_dict_if_present(checkpoint, "module.")
+        hubert.load_state_dict(checkpoint)
+        hubert.eval()
+    return hubert
+def _kmeans(
+    num_clusters: int, pretrained: bool = True, progress: bool = True
+) -> KMeans:
+    kmeans = KMeans(num_clusters)
+    if pretrained:
+        checkpoint = torch.hub.load_state_dict_from_url(
+            URLS[f"kmeans{num_clusters}"], progress=progress
+        )
+        kmeans.__dict__["n_features_in_"] = checkpoint["n_features_in_"]
+        kmeans.__dict__["_n_threads"] = checkpoint["_n_threads"]
+        kmeans.__dict__["cluster_centers_"] = checkpoint["cluster_centers_"].numpy()
+    return kmeans
+def kmeans100(pretrained: bool = True, progress: bool = True) -> KMeans:
+    r"""
+    k-means checkpoint for HuBERT-Discrete with 100 clusters.
+    Args:
+        pretrained (bool): load pretrained weights into the model
+        progress (bool): show progress bar when downloading model
+    """
+    return _kmeans(100, pretrained, progress)

enhancer.py ADDED Viewed

	@@ -0,0 +1,105 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+from nsf_hifigan.nvSTFT import STFT
+from nsf_hifigan.models import load_model
+from torchaudio.transforms import Resample
+class Enhancer:
+    def __init__(self, enhancer_type, enhancer_ckpt, device=None):
+        if device is None:
+            device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.device = device
+        if enhancer_type == 'nsf-hifigan':
+            self.enhancer = NsfHifiGAN(enhancer_ckpt, device=self.device)
+        else:
+            raise ValueError(f" [x] Unknown enhancer: {enhancer_type}")
+        self.resample_kernel = {}
+        self.enhancer_sample_rate = self.enhancer.sample_rate()
+        self.enhancer_hop_size = self.enhancer.hop_size()
+    def enhance(self,
+                audio, # 1, T
+                sample_rate,
+                f0, # 1, n_frames, 1
+                hop_size,
+                adaptive_key = 0,
+                silence_front = 0
+                ):
+        # enhancer start time
+        start_frame = int(silence_front * sample_rate / hop_size)
+        real_silence_front = start_frame * hop_size / sample_rate
+        audio = audio[:, int(np.round(real_silence_front * sample_rate)) : ]
+        f0 = f0[: , start_frame :, :]
+        # adaptive parameters
+        adaptive_factor = 2 ** ( -adaptive_key / 12)
+        adaptive_sample_rate = 100 * int(np.round(self.enhancer_sample_rate / adaptive_factor / 100))
+        real_factor = self.enhancer_sample_rate / adaptive_sample_rate
+        # resample the ddsp output
+        if sample_rate == adaptive_sample_rate:
+            audio_res = audio
+        else:
+            key_str = str(sample_rate) + str(adaptive_sample_rate)
+            if key_str not in self.resample_kernel:
+                self.resample_kernel[key_str] = Resample(sample_rate, adaptive_sample_rate, lowpass_filter_width = 128).to(self.device)
+            audio_res = self.resample_kernel[key_str](audio)
+        n_frames = int(audio_res.size(-1) // self.enhancer_hop_size + 1)
+        # resample f0
+        f0_np = f0.squeeze(0).squeeze(-1).cpu().numpy()
+        f0_np *= real_factor
+        time_org = (hop_size / sample_rate) * np.arange(len(f0_np)) / real_factor
+        time_frame = (self.enhancer_hop_size / self.enhancer_sample_rate) * np.arange(n_frames)
+        f0_res = np.interp(time_frame, time_org, f0_np, left=f0_np[0], right=f0_np[-1])
+        f0_res = torch.from_numpy(f0_res).unsqueeze(0).float().to(self.device) # 1, n_frames
+        # enhance
+        enhanced_audio, enhancer_sample_rate = self.enhancer(audio_res, f0_res)
+        # resample the enhanced output
+        if adaptive_factor != 0:
+            key_str = str(adaptive_sample_rate) + str(enhancer_sample_rate)
+            if key_str not in self.resample_kernel:
+                self.resample_kernel[key_str] = Resample(adaptive_sample_rate, enhancer_sample_rate, lowpass_filter_width = 128).to(self.device)
+            enhanced_audio =  self.resample_kernel[key_str](enhanced_audio)
+        # pad the silence frames
+        if start_frame > 0:
+            enhanced_audio = F.pad(enhanced_audio, (int(np.round(enhancer_sample_rate * real_silence_front)), 0))
+        return enhanced_audio, enhancer_sample_rate
+class NsfHifiGAN(torch.nn.Module):
+    def __init__(self, model_path, device=None):
+        super().__init__()
+        if device is None:
+            device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.device = device
+        print('| Load HifiGAN: ', model_path)
+        self.model, self.h = load_model(model_path, device=self.device)
+    def sample_rate(self):
+        return self.h.sampling_rate
+    def hop_size(self):
+        return self.h.hop_size
+    def forward(self, audio, f0):
+        stft = STFT(
+                self.h.sampling_rate,
+                self.h.num_mels,
+                self.h.n_fft,
+                self.h.win_size,
+                self.h.hop_size,
+                self.h.fmin,
+                self.h.fmax)
+        with torch.no_grad():
+            mel = stft.get_mel(audio)
+            enhanced_audio = self.model(mel, f0[:,:mel.size(-1)]).view(-1)
+            return enhanced_audio, self.h.sampling_rate

exp/gitkeep ADDED Viewed

File without changes

flask_api.py ADDED Viewed

	@@ -0,0 +1,173 @@

+import io
+import logging
+import torch
+import numpy as np
+import slicer
+import soundfile as sf
+import librosa
+from flask import Flask, request, send_file
+from flask_cors import CORS
+from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
+from ddsp.core import upsample
+from enhancer import Enhancer
+app = Flask(__name__)
+CORS(app)
+logging.getLogger("numba").setLevel(logging.WARNING)
+@app.route("/voiceChangeModel", methods=["POST"])
+def voice_change_model():
+    request_form = request.form
+    wave_file = request.files.get("sample", None)
+    # get fSafePrefixPadLength
+    f_safe_prefix_pad_length = float(request_form.get("fSafePrefixPadLength", 0))
+    print("f_safe_prefix_pad_length:"+str(f_safe_prefix_pad_length))
+    # 变调信息
+    f_pitch_change = float(request_form.get("fPitchChange", 0))
+    # 获取spk_id
+    int_speak_id = int(request_form.get("sSpeakId", 0))
+    if enable_spk_id_cover:
+        int_speak_id = spk_id
+    # print("说话人:" + str(int_speak_id))
+    # DAW所需的采样率
+    daw_sample = int(float(request_form.get("sampleRate", 0)))
+    # http获得wav文件并转换
+    input_wav_read = io.BytesIO(wave_file.read())
+    # 模型推理
+    _audio, _model_sr = svc_model.infer(input_wav_read, f_pitch_change, int_speak_id, f_safe_prefix_pad_length)
+    tar_audio = librosa.resample(_audio, _model_sr, daw_sample)
+    # 返回音频
+    out_wav_path = io.BytesIO()
+    sf.write(out_wav_path, tar_audio, daw_sample, format="wav")
+    out_wav_path.seek(0)
+    return send_file(out_wav_path, download_name="temp.wav", as_attachment=True)
+class SvcDDSP:
+    def __init__(self, model_path, vocoder_based_enhancer, enhancer_adaptive_key, input_pitch_extractor,
+                 f0_min, f0_max, threhold, spk_id, spk_mix_dict, enable_spk_id_cover):
+        self.model_path = model_path
+        self.vocoder_based_enhancer = vocoder_based_enhancer
+        self.enhancer_adaptive_key = enhancer_adaptive_key
+        self.input_pitch_extractor = input_pitch_extractor
+        self.f0_min = f0_min
+        self.f0_max = f0_max
+        self.threhold = threhold
+        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.spk_id = spk_id
+        self.spk_mix_dict = spk_mix_dict
+        self.enable_spk_id_cover = enable_spk_id_cover
+        # load ddsp model
+        self.model, self.args = load_model(self.model_path, device=self.device)
+        # load units encoder
+        self.units_encoder = Units_Encoder(
+            self.args.data.encoder,
+            self.args.data.encoder_ckpt,
+            self.args.data.encoder_sample_rate,
+            self.args.data.encoder_hop_size,
+            device=self.device)
+        # load enhancer
+        if self.vocoder_based_enhancer:
+            self.enhancer = Enhancer(self.args.enhancer.type, self.args.enhancer.ckpt, device=self.device)
+    def infer(self, input_wav, pitch_adjust, speaker_id, safe_prefix_pad_length):
+        print("Infer!")
+        # load input
+        audio, sample_rate = librosa.load(input_wav, sr=None, mono=True)
+        if len(audio.shape) > 1:
+            audio = librosa.to_mono(audio)
+        hop_size = self.args.data.block_size * sample_rate / self.args.data.sampling_rate
+        # safe front silence
+        if safe_prefix_pad_length > 0.03:
+            silence_front = safe_prefix_pad_length - 0.03
+        else:
+            silence_front = 0
+        # extract f0
+        pitch_extractor = F0_Extractor(
+            self.input_pitch_extractor,
+            sample_rate,
+            hop_size,
+            float(self.f0_min),
+            float(self.f0_max))
+        f0 = pitch_extractor.extract(audio, uv_interp=True, device=self.device, silence_front=silence_front)
+        f0 = torch.from_numpy(f0).float().to(self.device).unsqueeze(-1).unsqueeze(0)
+        f0 = f0 * 2 ** (float(pitch_adjust) / 12)
+        # extract volume
+        volume_extractor = Volume_Extractor(hop_size)
+        volume = volume_extractor.extract(audio)
+        mask = (volume > 10 ** (float(self.threhold) / 20)).astype('float')
+        mask = np.pad(mask, (4, 4), constant_values=(mask[0], mask[-1]))
+        mask = np.array([np.max(mask[n : n + 9]) for n in range(len(mask) - 8)])
+        mask = torch.from_numpy(mask).float().to(self.device).unsqueeze(-1).unsqueeze(0)
+        mask = upsample(mask, self.args.data.block_size).squeeze(-1)
+        volume = torch.from_numpy(volume).float().to(self.device).unsqueeze(-1).unsqueeze(0)
+        # extract units
+        audio_t = torch.from_numpy(audio).float().unsqueeze(0).to(self.device)
+        units = self.units_encoder.encode(audio_t, sample_rate, hop_size)
+        # spk_id or spk_mix_dict
+        if self.enable_spk_id_cover:
+            spk_id = self.spk_id
+        else:
+            spk_id = speaker_id
+        spk_id = torch.LongTensor(np.array([[spk_id]])).to(self.device)
+        # forward and return the output
+        with torch.no_grad():
+            output, _, (s_h, s_n) = self.model(units, f0, volume, spk_id = spk_id, spk_mix_dict = self.spk_mix_dict)
+            output *= mask
+            if self.vocoder_based_enhancer:
+                output, output_sample_rate = self.enhancer.enhance(
+                                                                output,
+                                                                self.args.data.sampling_rate,
+                                                                f0,
+                                                                self.args.data.block_size,
+                                                                adaptive_key = self.enhancer_adaptive_key,
+                                                                silence_front = silence_front)
+            else:
+                output_sample_rate = self.args.data.sampling_rate
+            output = output.squeeze().cpu().numpy()
+            return output, output_sample_rate
+if __name__ == "__main__":
+    # ddsp-svc下只需传入下列参数。
+    # 对接的是串串香火锅大佬https://github.com/zhaohui8969/VST_NetProcess-。建议使用最新版本。
+    # flask部分来自diffsvc小狼大佬编写的代码。
+    # config和模型得同一目录。
+    checkpoint_path = "exp/multi_speaker/model_300000.pt"
+    # 是否使用预训练的基于声码器的增强器增强输出，但对硬件要求更高。
+    use_vocoder_based_enhancer = True
+    # 结合增强器使用，0为正常音域范围（最高G5)内的高音频质量，大于0则可以防止超高音破音
+    enhancer_adaptive_key = 0
+    # f0提取器，有parselmouth, dio, harvest, crepe
+    select_pitch_extractor = 'crepe'
+    # f0范围限制(Hz)
+    limit_f0_min = 50
+    limit_f0_max = 1100
+    # 音量响应阈值(dB)
+    threhold = -60
+    # 默认说话人。以及是否优先使用默认说话人覆盖vst传入的参数。
+    spk_id = 1
+    enable_spk_id_cover = True
+    # 混合说话人字典（捏音色功能）
+    # 设置为非 None 字典会覆盖 spk_id
+    spk_mix_dict = None # {1:0.5, 2:0.5} 表示1号说话人和2号说话人的音色按照0.5:0.5的比例混合
+    svc_model = SvcDDSP(checkpoint_path, use_vocoder_based_enhancer, enhancer_adaptive_key, select_pitch_extractor,
+                        limit_f0_min, limit_f0_max, threhold, spk_id, spk_mix_dict, enable_spk_id_cover)
+    # 此处与vst插件对应，端口必须接上。
+    app.run(port=6844, host="0.0.0.0", debug=False, threaded=False)

gui.py ADDED Viewed

	@@ -0,0 +1,299 @@

+import PySimpleGUI as sg
+import sounddevice as sd
+import torch,librosa,threading,time
+from enhancer import Enhancer
+import numpy as np
+from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
+from ddsp.core import upsample
+class SvcDDSP:
+    def __init__(self, model_path, vocoder_based_enhancer, enhancer_adaptive_key, input_pitch_extractor,
+                 f0_min, f0_max, threhold, spk_id, spk_mix_dict, enable_spk_id_cover):
+        self.model_path = model_path
+        self.vocoder_based_enhancer = vocoder_based_enhancer
+        self.enhancer_adaptive_key = enhancer_adaptive_key
+        self.input_pitch_extractor = input_pitch_extractor
+        self.f0_min = f0_min
+        self.f0_max = f0_max
+        self.threhold = threhold
+        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.spk_id = spk_id
+        self.spk_mix_dict = spk_mix_dict
+        self.enable_spk_id_cover = enable_spk_id_cover
+        # load ddsp model
+        self.model, self.args = load_model(self.model_path, device=self.device)
+        # load units encoder
+        self.units_encoder = Units_Encoder(
+            self.args.data.encoder,
+            self.args.data.encoder_ckpt,
+            self.args.data.encoder_sample_rate,
+            self.args.data.encoder_hop_size,
+            device=self.device)
+        # load enhancer
+        if self.vocoder_based_enhancer:
+            self.enhancer = Enhancer(self.args.enhancer.type, self.args.enhancer.ckpt, device=self.device)
+    def infer(self,  pitch_adjust, speaker_id, safe_prefix_pad_length,audio,sample_rate):
+        print("Infering...")
+        # load input
+        #audio, sample_rate = librosa.load(input_wav, sr=None, mono=True)
+        hop_size = self.args.data.block_size * sample_rate / self.args.data.sampling_rate
+        # safe front silence
+        if safe_prefix_pad_length > 0.03:
+            silence_front = safe_prefix_pad_length - 0.03
+        else:
+            silence_front = 0
+        # extract f0
+        pitch_extractor = F0_Extractor(
+            self.input_pitch_extractor,
+            sample_rate,
+            hop_size,
+            float(self.f0_min),
+            float(self.f0_max))
+        f0 = pitch_extractor.extract(audio, uv_interp=True, device=self.device, silence_front=silence_front)
+        f0 = torch.from_numpy(f0).float().to(self.device).unsqueeze(-1).unsqueeze(0)
+        f0 = f0 * 2 ** (float(pitch_adjust) / 12)
+        # extract volume
+        volume_extractor = Volume_Extractor(hop_size)
+        volume = volume_extractor.extract(audio)
+        mask = (volume > 10 ** (float(self.threhold) / 20)).astype('float')
+        mask = np.pad(mask, (4, 4), constant_values=(mask[0], mask[-1]))
+        mask = np.array([np.max(mask[n : n + 9]) for n in range(len(mask) - 8)])
+        mask = torch.from_numpy(mask).float().to(self.device).unsqueeze(-1).unsqueeze(0)
+        mask = upsample(mask, self.args.data.block_size).squeeze(-1)
+        volume = torch.from_numpy(volume).float().to(self.device).unsqueeze(-1).unsqueeze(0)
+        # extract units
+        audio_t = torch.from_numpy(audio).float().unsqueeze(0).to(self.device)
+        units = self.units_encoder.encode(audio_t, sample_rate, hop_size)
+        # spk_id or spk_mix_dict
+        if self.enable_spk_id_cover:
+            spk_id = self.spk_id
+        else:
+            spk_id = speaker_id
+        spk_id = torch.LongTensor(np.array([[spk_id]])).to(self.device)
+        # forward and return the output
+        with torch.no_grad():
+            output, _, (s_h, s_n) = self.model(units, f0, volume, spk_id = spk_id, spk_mix_dict = self.spk_mix_dict)
+            output *= mask
+            if self.vocoder_based_enhancer:
+                output, output_sample_rate = self.enhancer.enhance(
+                                                                output,
+                                                                self.args.data.sampling_rate,
+                                                                f0,
+                                                                self.args.data.block_size,
+                                                                adaptive_key = self.enhancer_adaptive_key,
+                                                                silence_front = silence_front)
+            else:
+                output_sample_rate = self.args.data.sampling_rate
+            output = output.squeeze().cpu().numpy()
+            return output, output_sample_rate
+class GUI:
+    def __init__(self) -> None:
+        self.flag_vc:bool=False#变声线程flag
+        self.samplerate=44100#Hz
+        self.block_time=1.5#s
+        self.block_frame=0
+        self.crossfade_frame=0
+        self.fade_in_window:np.ndarray=None#crossfade计算用numpy数组
+        self.fade_out_window:np.ndarray=None#crossfade计算用numpy数组
+        self.f_safe_prefix_pad_length:float = 1.0
+        self.input_wav:np.ndarray=None#输入音频规范化后的保存地址
+        self.output_wav:np.ndarray=None#输出音频规范化后的保存地址
+        self.temp_wav:np.ndarray=None#包含crossfade和输出音频的缓存区
+        self.f_pitch_change:float = 0.0#float(request_form.get("fPitchChange", 0))
+        self.crossfade_last:np.ndarray=None#保存上一个output的crossfade
+        self.f0_mode=["parselmouth", "dio", "harvest", "crepe"]#F0预测器
+        self.spk_id = 1# 默认说话人。
+        self.svc_model:SvcDDSP = None
+        self.launcher()#start
+        # 混合说话人字典（捏音色功能）
+        # 设置为非 None 字典会覆盖 spk_id
+        self.spk_mix_dict = None # {1:0.5, 2:0.5} 表示1号说话人和2号说话人的音色按照0.5:0.5的比例混合
+        self.use_vocoder_based_enhancer = True
+    def launcher(self):
+        '''窗口加载'''
+        input_devices,output_devices,_, _=self.get_devices()
+        sg.theme('DarkAmber')   # 设置主题
+        # 界面布局
+        layout = [
+            [   sg.Frame(layout=[
+                    [sg.Input(key='sg_model',default_text='exp\\model_chino.pt'),sg.FileBrowse('选择模型文件')]
+                ],title='模型.pt格式(自动识别同目录下config.yaml)')
+            ],
+            [   sg.Frame(layout=[
+                    [sg.Text("输入设备"),sg.Combo(input_devices,key='sg_input_device',default_value=input_devices[sd.default.device[0]])],
+                    [sg.Text("输出设备"),sg.Combo(output_devices,key='sg_output_device',default_value=output_devices[sd.default.device[1]])]
+                ],title='音频设备')
+            ],
+            [   sg.Frame(layout=[
+                    [sg.Text("说话人id"),sg.Input(key='spk_id',default_text='1')],
+                    [sg.Text("响应阈值"),sg.Slider(range=(-60,0),orientation='h',key='noise',resolution=1,default_value=-35)],
+                    [sg.Text("变调"),sg.Slider(range=(-24,24),orientation='h',key='pitch',resolution=1,default_value=12)],
+                    [sg.Text("采样率"),sg.Input(key='samplerate',default_text='44100')],
+                    [sg.Checkbox(text='启用捏音色功能',default=False,key='spk_mix'),sg.Button("设置混合音色",key='set_spk_mix')]
+                ],title='普通设置'),
+                sg.Frame(layout=[
+                    [sg.Text("音频切分大小"),sg.Slider(range=(0.1,3.0),orientation='h',key='block',resolution=0.05,default_value=0.5)],
+                    [sg.Text("交叉淡化时长"),sg.Slider(range=(0.02,0.1),orientation='h',key='crossfade',resolution=0.01)],
+                    [sg.Text("使用历史区块数量"),sg.Slider(range=(1,10),orientation='h',key='buffernum',resolution=1,default_value=2)],
+                    [sg.Text("f0预测模式"),sg.Combo(values=self.f0_mode,key='f0_mode',default_value=self.f0_mode[2])],
+                    [sg.Checkbox(text='启用增强器',default=True,key='use_enhancer')]
+                ],title='性能设置'),
+            ],
+            [sg.Button("开始音频转换",key="start_vc"),sg.Button("停止音频转换",key="stop_vc")]
+        ]
+        # 创造窗口
+        window = sg.Window('DDSP - GUI by INT16', layout)
+        self.event_handler(window=window)
+    def event_handler(self,window):
+        '''事件处理'''
+        while True:#事件处理循环
+            event, values = window.read()
+            if event ==sg.WINDOW_CLOSED:   # 如果用户关闭窗口
+                self.flag_vc=False
+                exit()
+            if event=='start_vc' and self.flag_vc==False:
+                #set values 和界面布局layout顺序一一对应
+                checkpoint_path = values['sg_model']
+                self.set_devices(values["sg_input_device"],values['sg_output_device'])
+                self.spk_id=int(values['spk_id'])
+                threhold = values['noise']
+                self.f_pitch_change = values['pitch']
+                self.samplerate=int(values['samplerate'])
+                block_time = float(values['block'])
+                crossfade_time = values['crossfade']
+                buffer_num = int(values['buffernum'])
+                select_pitch_extractor=values['f0_mode']
+                self.use_vocoder_based_enhancer=values['use_enhancer']
+                if not values['spk_mix']:
+                    self.spk_mix_dict=None
+                self.block_frame=int(block_time*self.samplerate)
+                self.crossfade_frame=int(crossfade_time*self.samplerate)
+                self.f_safe_prefix_pad_length=block_time*(buffer_num)-crossfade_time*2
+                print('crossfade_time:'+str(crossfade_time))
+                print("buffer_num:"+str(buffer_num))
+                print("samplerate:"+str(self.samplerate))
+                print('block_time:'+str(block_time))
+                print("prefix_pad_length:"+str(self.f_safe_prefix_pad_length))
+                print("mix_mode:"+str(self.spk_mix_dict))
+                print("enhancer:"+str(self.use_vocoder_based_enhancer))
+                self.start_vc(checkpoint_path,select_pitch_extractor,threhold,buffer_num)
+            if event=='stop_vc'and self.flag_vc==True:
+                self.flag_vc = False
+            if event=='set_spk_mix' and self.flag_vc==False:
+                spk_mix = sg.popup_get_text(message='示例：1:0.3,2:0.5,3:0.2',title="设置混合音色，支持多人")
+                if spk_mix != None:
+                    self.spk_mix_dict=eval("{"+spk_mix.replace('，',',').replace('：',':')+"}")
+    def start_vc(self,checkpoint_path,select_pitch_extractor,threhold,buffer_num):
+        '''开始音频转换'''
+        self.flag_vc = True
+        # 是否使用预训练的基于声码器的增强器增强输出，但对硬件要求更高。
+        enhancer_adaptive_key = 0
+        # f0范围限制(Hz)
+        limit_f0_min = 50
+        limit_f0_max = 1100
+        enable_spk_id_cover = True
+        #初始化一下各个ndarray
+        self.input_wav=np.zeros(int((1+buffer_num)*self.block_frame),dtype='float32')
+        self.output_wav=np.zeros(self.block_frame,dtype='float32')
+        self.temp_wav=np.zeros(self.block_frame+self.crossfade_frame,dtype='float32')
+        self.crossfade_last=np.zeros(self.crossfade_frame,dtype='float32')
+        self.fade_in_window = np.linspace(0, 1,self.crossfade_frame)
+        self.fade_out_window = np.linspace(1, 0,self.crossfade_frame)
+        self.svc_model = SvcDDSP(checkpoint_path, self.use_vocoder_based_enhancer, enhancer_adaptive_key, select_pitch_extractor,limit_f0_min, limit_f0_max, threhold, self.spk_id, self.spk_mix_dict, enable_spk_id_cover)
+        thread_vc=threading.Thread(target=self.soundinput)
+        thread_vc.start()
+    def soundinput(self):
+        '''
+        接受音频输入
+        '''
+        with sd.Stream(callback=self.audio_callback, blocksize=self.block_frame,samplerate=self.samplerate,dtype='float32'):
+            while self.flag_vc:
+                time.sleep(self.block_time)
+                print('Audio block passed.')
+        print('ENDing VC')
+    def audio_callback(self,indata,outdata, frames, time, status):
+        '''
+        音频处理
+        '''
+        print("Realtime VCing...")
+        self.input_wav[:]=np.roll(self.input_wav,-self.block_frame)
+        self.input_wav[-self.block_frame:]=librosa.to_mono(indata.T)
+        print('input_wav.shape:'+str(self.input_wav.shape))
+        _audio, _model_sr = self.svc_model.infer( self.f_pitch_change, self.spk_id, self.f_safe_prefix_pad_length,self.input_wav,self.samplerate)
+        self.temp_wav[:] = librosa.resample(_audio, orig_sr=_model_sr, target_sr=self.samplerate)[-self.block_frame-self.crossfade_frame:]
+        #cross-fade output_wav's start with last crossfade
+        self.output_wav[:]=self.temp_wav[:self.block_frame]
+        self.output_wav[:self.crossfade_frame]*=self.fade_in_window
+        self.output_wav[:self.crossfade_frame]+=self.crossfade_last
+        self.crossfade_last[:]=self.temp_wav[-self.crossfade_frame:]
+        self.crossfade_last[:]*=self.fade_out_window
+        print("infered _audio.shape:"+str(_audio.shape))
+        outdata[:] = np.array([self.output_wav, self.output_wav]).T
+        print('Outputed.')
+    def get_devices(self,update: bool = True):
+        '''获取设备列表'''
+        if update:
+            sd._terminate()
+            sd._initialize()
+        devices = sd.query_devices()
+        hostapis = sd.query_hostapis()
+        for hostapi in hostapis:
+            for device_idx in hostapi["devices"]:
+                devices[device_idx]["hostapi_name"] = hostapi["name"]
+        input_devices = [
+            f"{d['name']} ({d['hostapi_name']})"
+            for d in devices
+            if d["max_input_channels"] > 0
+        ]
+        output_devices = [
+            f"{d['name']} ({d['hostapi_name']})"
+            for d in devices
+            if d["max_output_channels"] > 0
+        ]
+        input_devices_indices = [d["index"] for d in devices if d["max_input_channels"] > 0]
+        output_devices_indices = [
+            d["index"] for d in devices if d["max_output_channels"] > 0
+        ]
+        return input_devices, output_devices, input_devices_indices, output_devices_indices
+    def set_devices(self,input_device,output_device):
+        '''设置输出设备'''
+        input_devices,output_devices,input_device_indices, output_device_indices=self.get_devices()
+        sd.default.device[0]=input_device_indices[input_devices.index(input_device)]
+        sd.default.device[1]=output_device_indices[output_devices.index(output_device)]
+        print("input device:"+str(sd.default.device[0])+":"+str(input_device))
+        print("output device:"+str(sd.default.device[1])+":"+str(output_device))
+if __name__ == "__main__":
+    gui=GUI()

logger/__init__.py ADDED Viewed

File without changes

logger/saver.py ADDED Viewed

	@@ -0,0 +1,123 @@

+'''
+author: wayn391@mastertones
+'''
+import os
+import json
+import time
+import yaml
+import datetime
+import torch
+from . import utils
+from torch.utils.tensorboard import SummaryWriter
+class Saver(object):
+    def __init__(
+            self,
+            args,
+            initial_global_step=-1):
+        self.expdir = args.env.expdir
+        self.sample_rate = args.data.sampling_rate
+        # cold start
+        self.global_step = initial_global_step
+        self.init_time = time.time()
+        self.last_time = time.time()
+        # makedirs
+        os.makedirs(self.expdir, exist_ok=True)
+        # path
+        self.path_log_info = os.path.join(self.expdir, 'log_info.txt')
+        # ckpt
+        os.makedirs(self.expdir, exist_ok=True)
+        # writer
+        self.writer = SummaryWriter(os.path.join(self.expdir, 'logs'))
+        # save config
+        path_config = os.path.join(self.expdir, 'config.yaml')
+        with open(path_config, "w") as out_config:
+            yaml.dump(dict(args), out_config)
+    def log_info(self, msg):
+        '''log method'''
+        if isinstance(msg, dict):
+            msg_list = []
+            for k, v in msg.items():
+                tmp_str = ''
+                if isinstance(v, int):
+                    tmp_str = '{}: {:,}'.format(k, v)
+                else:
+                    tmp_str = '{}: {}'.format(k, v)
+                msg_list.append(tmp_str)
+            msg_str = '\n'.join(msg_list)
+        else:
+            msg_str = msg
+        # dsplay
+        print(msg_str)
+        # save
+        with open(self.path_log_info, 'a') as fp:
+            fp.write(msg_str+'\n')
+    def log_value(self, dict):
+        for k, v in dict.items():
+            self.writer.add_scalar(k, v, self.global_step)
+    def log_audio(self, dict):
+        for k, v in dict.items():
+            self.writer.add_audio(k, v, global_step=self.global_step, sample_rate=self.sample_rate)
+    def get_interval_time(self, update=True):
+        cur_time = time.time()
+        time_interval = cur_time - self.last_time
+        if update:
+            self.last_time = cur_time
+        return time_interval
+    def get_total_time(self, to_str=True):
+        total_time = time.time() - self.init_time
+        if to_str:
+            total_time = str(datetime.timedelta(
+                seconds=total_time))[:-5]
+        return total_time
+    def save_model(
+            self,
+            model,
+            optimizer,
+            name='model',
+            postfix='',
+            to_json=False):
+        # path
+        if postfix:
+            postfix = '_' + postfix
+        path_pt = os.path.join(
+            self.expdir , name+postfix+'.pt')
+        # check
+        print(' [*] model checkpoint saved: {}'.format(path_pt))
+        # save
+        torch.save({
+            'global_step': self.global_step,
+            'model': model.state_dict(),
+            'optimizer': optimizer.state_dict()}, path_pt)
+        # to json
+        if to_json:
+            path_json = os.path.join(
+                self.expdir , name+'.json')
+            utils.to_json(path_params, path_json)
+    def global_step_increment(self):
+        self.global_step += 1

logger/utils.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import os
+import yaml
+import json
+import pickle
+import torch
+def traverse_dir(
+        root_dir,
+        extension,
+        amount=None,
+        str_include=None,
+        str_exclude=None,
+        is_pure=False,
+        is_sort=False,
+        is_ext=True):
+    file_list = []
+    cnt = 0
+    for root, _, files in os.walk(root_dir):
+        for file in files:
+            if file.endswith(extension):
+                # path
+                mix_path = os.path.join(root, file)
+                pure_path = mix_path[len(root_dir)+1:] if is_pure else mix_path
+                # amount
+                if (amount is not None) and (cnt == amount):
+                    if is_sort:
+                        file_list.sort()
+                    return file_list
+                # check string
+                if (str_include is not None) and (str_include not in pure_path):
+                    continue
+                if (str_exclude is not None) and (str_exclude in pure_path):
+                    continue
+                if not is_ext:
+                    ext = pure_path.split('.')[-1]
+                    pure_path = pure_path[:-(len(ext)+1)]
+                file_list.append(pure_path)
+                cnt += 1
+    if is_sort:
+        file_list.sort()
+    return file_list
+class DotDict(dict):
+    def __getattr__(*args):
+        val = dict.get(*args)
+        return DotDict(val) if type(val) is dict else val
+    __setattr__ = dict.__setitem__
+    __delattr__ = dict.__delitem__
+def get_network_paras_amount(model_dict):
+    info = dict()
+    for model_name, model in model_dict.items():
+        # all_params = sum(p.numel() for p in model.parameters())
+        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+        info[model_name] = trainable_params
+    return info
+def load_config(path_config):
+    with open(path_config, "r") as config:
+        args = yaml.safe_load(config)
+    args = DotDict(args)
+    # print(args)
+    return args
+def to_json(path_params, path_json):
+    params = torch.load(path_params, map_location=torch.device('cpu'))
+    raw_state_dict = {}
+    for k, v in params.items():
+        val = v.flatten().numpy().tolist()
+        raw_state_dict[k] = val
+    with open(path_json, 'w') as outfile:
+        json.dump(raw_state_dict, outfile,indent= "\t")
+def convert_tensor_to_numpy(tensor, is_squeeze=True):
+    if is_squeeze:
+        tensor = tensor.squeeze()
+    if tensor.requires_grad:
+        tensor = tensor.detach()
+    if tensor.is_cuda:
+        tensor = tensor.cpu()
+    return tensor.numpy()
+def load_model(
+        expdir,
+        model,
+        optimizer,
+        name='model',
+        postfix='',
+        device='cpu'):
+    if postfix == '':
+        postfix = '_' + postfix
+    path = os.path.join(expdir, name+postfix)
+    path_pt = traverse_dir(expdir, '.pt', is_ext=False)
+    global_step = 0
+    if len(path_pt) > 0:
+        steps = [s[len(path):] for s in path_pt]
+        maxstep = max([int(s) if s.isdigit() else 0 for s in steps])
+        if maxstep > 0:
+            path_pt = path+str(maxstep)+'.pt'
+        else:
+            path_pt = path+'best.pt'
+        print(' [*] restoring model from', path_pt)
+        ckpt = torch.load(path_pt, map_location=torch.device(device))
+        global_step = ckpt['global_step']
+        model.load_state_dict(ckpt['model'])
+        optimizer.load_state_dict(ckpt['optimizer'])
+    return global_step, model, optimizer

main.py ADDED Viewed

	@@ -0,0 +1,245 @@

+import os
+import torch
+import librosa
+import argparse
+import numpy as np
+import soundfile as sf
+import pyworld as pw
+import parselmouth
+from ast import literal_eval
+from slicer import Slicer
+from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
+from ddsp.core import upsample
+from enhancer import Enhancer
+from tqdm import tqdm
+def parse_args(args=None, namespace=None):
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m",
+        "--model_path",
+        type=str,
+        required=True,
+        help="path to the model file",
+    )
+    parser.add_argument(
+        "-i",
+        "--input",
+        type=str,
+        required=True,
+        help="path to the input audio file",
+    )
+    parser.add_argument(
+        "-o",
+        "--output",
+        type=str,
+        required=True,
+        help="path to the output audio file",
+    )
+    parser.add_argument(
+        "-id",
+        "--spk_id",
+        type=str,
+        required=False,
+        default=1,
+        help="speaker id (for multi-speaker model) | default: 1",
+    )
+    parser.add_argument(
+        "-mix",
+        "--spk_mix_dict",
+        type=str,
+        required=False,
+        default="None",
+        help="mix-speaker dictionary (for multi-speaker model) | default: None",
+    )
+    parser.add_argument(
+        "-k",
+        "--key",
+        type=str,
+        required=False,
+        default=0,
+        help="key changed (number of semitones) | default: 0",
+    )
+    parser.add_argument(
+        "-e",
+        "--enhance",
+        type=str,
+        required=False,
+        default='true',
+        help="true or false | default: true",
+    )
+    parser.add_argument(
+        "-pe",
+        "--pitch_extractor",
+        type=str,
+        required=False,
+        default='crepe',
+        help="pitch extrator type: parselmouth, dio, harvest, crepe (default)",
+    )
+    parser.add_argument(
+        "-fmin",
+        "--f0_min",
+        type=str,
+        required=False,
+        default=50,
+        help="min f0 (Hz) | default: 50",
+    )
+    parser.add_argument(
+        "-fmax",
+        "--f0_max",
+        type=str,
+        required=False,
+        default=1100,
+        help="max f0 (Hz) | default: 1100",
+    )
+    parser.add_argument(
+        "-th",
+        "--threhold",
+        type=str,
+        required=False,
+        default=-60,
+        help="response threhold (dB) | default: -60",
+    )
+    parser.add_argument(
+        "-eak",
+        "--enhancer_adaptive_key",
+        type=str,
+        required=False,
+        default=0,
+        help="adapt the enhancer to a higher vocal range (number of semitones) | default: 0",
+    )
+    return parser.parse_args(args=args, namespace=namespace)
+def split(audio, sample_rate, hop_size, db_thresh = -40, min_len = 5000):
+    slicer = Slicer(
+                sr=sample_rate,
+                threshold=db_thresh,
+                min_length=min_len)
+    chunks = dict(slicer.slice(audio))
+    result = []
+    for k, v in chunks.items():
+        tag = v["split_time"].split(",")
+        if tag[0] != tag[1]:
+            start_frame = int(int(tag[0]) // hop_size)
+            end_frame = int(int(tag[1]) // hop_size)
+            if end_frame > start_frame:
+                result.append((
+                        start_frame,
+                        audio[int(start_frame * hop_size) : int(end_frame * hop_size)]))
+    return result
+def cross_fade(a: np.ndarray, b: np.ndarray, idx: int):
+    result = np.zeros(idx + b.shape[0])
+    fade_len = a.shape[0] - idx
+    np.copyto(dst=result[:idx], src=a[:idx])
+    k = np.linspace(0, 1.0, num=fade_len, endpoint=True)
+    result[idx: a.shape[0]] = (1 - k) * a[idx:] + k * b[: fade_len]
+    np.copyto(dst=result[a.shape[0]:], src=b[fade_len:])
+    return result
+if __name__ == '__main__':
+    #device = 'cpu'
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    # parse commands
+    cmd = parse_args()
+    # load ddsp model
+    model, args = load_model(cmd.model_path, device=device)
+    # load input
+    audio, sample_rate = librosa.load(cmd.input, sr=None)
+    if len(audio.shape) > 1:
+        audio = librosa.to_mono(audio)
+    hop_size = args.data.block_size * sample_rate / args.data.sampling_rate
+    # extract f0
+    print('Pitch extractor type: ' + cmd.pitch_extractor)
+    pitch_extractor = F0_Extractor(
+                        cmd.pitch_extractor,
+                        sample_rate,
+                        hop_size,
+                        float(cmd.f0_min),
+                        float(cmd.f0_max))
+    print('Extracting the pitch curve of the input audio...')
+    f0 = pitch_extractor.extract(audio, uv_interp = True, device = device)
+    f0 = torch.from_numpy(f0).float().to(device).unsqueeze(-1).unsqueeze(0)
+    # key change
+    f0 = f0 * 2 ** (float(cmd.key) / 12)
+    # extract volume
+    print('Extracting the volume envelope of the input audio...')
+    volume_extractor = Volume_Extractor(hop_size)
+    volume = volume_extractor.extract(audio)
+    mask = (volume > 10 ** (float(cmd.threhold) / 20)).astype('float')
+    mask = np.pad(mask, (4, 4), constant_values=(mask[0], mask[-1]))
+    mask = np.array([np.max(mask[n : n + 9]) for n in range(len(mask) - 8)])
+    mask = torch.from_numpy(mask).float().to(device).unsqueeze(-1).unsqueeze(0)
+    mask = upsample(mask, args.data.block_size).squeeze(-1)
+    volume = torch.from_numpy(volume).float().to(device).unsqueeze(-1).unsqueeze(0)
+    # load units encoder
+    units_encoder = Units_Encoder(
+                        args.data.encoder,
+                        args.data.encoder_ckpt,
+                        args.data.encoder_sample_rate,
+                        args.data.encoder_hop_size,
+                        device = device)
+    # load enhancer
+    if cmd.enhance == 'true':
+        print('Enhancer type: ' + args.enhancer.type)
+        enhancer = Enhancer(args.enhancer.type, args.enhancer.ckpt, device=device)
+    else:
+        print('Enhancer type: none (using raw output of ddsp)')
+    # speaker id or mix-speaker dictionary
+    spk_mix_dict = literal_eval(cmd.spk_mix_dict)
+    if spk_mix_dict is not None:
+        print('Mix-speaker mode')
+    else:
+        print('Speaker ID: '+ str(int(cmd.spk_id)))
+    spk_id = torch.LongTensor(np.array([[int(cmd.spk_id)]])).to(device)
+    # forward and save the output
+    result = np.zeros(0)
+    current_length = 0
+    segments = split(audio, sample_rate, hop_size)
+    print('Cut the input audio into ' + str(len(segments)) + ' slices')
+    with torch.no_grad():
+        for segment in tqdm(segments):
+            start_frame = segment[0]
+            seg_input = torch.from_numpy(segment[1]).float().unsqueeze(0).to(device)
+            seg_units = units_encoder.encode(seg_input, sample_rate, hop_size)
+            seg_f0 = f0[:, start_frame : start_frame + seg_units.size(1), :]
+            seg_volume = volume[:, start_frame : start_frame + seg_units.size(1), :]
+            seg_output, _, (s_h, s_n) = model(seg_units, seg_f0, seg_volume, spk_id = spk_id, spk_mix_dict = spk_mix_dict)
+            seg_output *= mask[:, start_frame * args.data.block_size : (start_frame + seg_units.size(1)) * args.data.block_size]
+            if cmd.enhance == 'true':
+                seg_output, output_sample_rate = enhancer.enhance(
+                                                            seg_output,
+                                                            args.data.sampling_rate,
+                                                            seg_f0,
+                                                            args.data.block_size,
+                                                            adaptive_key = float(cmd.enhancer_adaptive_key))
+            else:
+                output_sample_rate = args.data.sampling_rate
+            seg_output = seg_output.squeeze().cpu().numpy()
+            silent_length = round(start_frame * args.data.block_size * output_sample_rate / args.data.sampling_rate) - current_length
+            if silent_length >= 0:
+                result = np.append(result, np.zeros(silent_length))
+                result = np.append(result, seg_output)
+            else:
+                result = cross_fade(result, seg_output, current_length + silent_length)
+            current_length = current_length + silent_length + len(seg_output)
+        sf.write(cmd.output, result, output_sample_rate)

nsf_hifigan/env.py ADDED Viewed

	@@ -0,0 +1,15 @@

+import os
+import shutil
+class AttrDict(dict):
+    def __init__(self, *args, **kwargs):
+        super(AttrDict, self).__init__(*args, **kwargs)
+        self.__dict__ = self
+def build_env(config, config_name, path):
+    t_path = os.path.join(path, config_name)
+    if config != t_path:
+        os.makedirs(path, exist_ok=True)
+        shutil.copyfile(config, os.path.join(path, config_name))

nsf_hifigan/models.py ADDED Viewed

	@@ -0,0 +1,435 @@

+import os
+import json
+from .env import AttrDict
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
+from .utils import init_weights, get_padding
+LRELU_SLOPE = 0.1
+def load_model(model_path, device='cuda'):
+    config_file = os.path.join(os.path.split(model_path)[0], 'config.json')
+    with open(config_file) as f:
+        data = f.read()
+    json_config = json.loads(data)
+    h = AttrDict(json_config)
+    generator = Generator(h).to(device)
+    cp_dict = torch.load(model_path, map_location=device)
+    generator.load_state_dict(cp_dict['generator'])
+    generator.eval()
+    generator.remove_weight_norm()
+    del cp_dict
+    return generator, h
+class ResBlock1(torch.nn.Module):
+    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
+        super(ResBlock1, self).__init__()
+        self.h = h
+        self.convs1 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                               padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                               padding=get_padding(kernel_size, dilation[1]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+                               padding=get_padding(kernel_size, dilation[2])))
+        ])
+        self.convs1.apply(init_weights)
+        self.convs2 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1)))
+        ])
+        self.convs2.apply(init_weights)
+    def forward(self, x):
+        for c1, c2 in zip(self.convs1, self.convs2):
+            xt = F.leaky_relu(x, LRELU_SLOPE)
+            xt = c1(xt)
+            xt = F.leaky_relu(xt, LRELU_SLOPE)
+            xt = c2(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs1:
+            remove_weight_norm(l)
+        for l in self.convs2:
+            remove_weight_norm(l)
+class ResBlock2(torch.nn.Module):
+    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
+        super(ResBlock2, self).__init__()
+        self.h = h
+        self.convs = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                               padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                               padding=get_padding(kernel_size, dilation[1])))
+        ])
+        self.convs.apply(init_weights)
+    def forward(self, x):
+        for c in self.convs:
+            xt = F.leaky_relu(x, LRELU_SLOPE)
+            xt = c(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs:
+            remove_weight_norm(l)
+class SineGen(torch.nn.Module):
+    """ Definition of sine generator
+    SineGen(samp_rate, harmonic_num = 0,
+            sine_amp = 0.1, noise_std = 0.003,
+            voiced_threshold = 0,
+            flag_for_pulse=False)
+    samp_rate: sampling rate in Hz
+    harmonic_num: number of harmonic overtones (default 0)
+    sine_amp: amplitude of sine-wavefrom (default 0.1)
+    noise_std: std of Gaussian noise (default 0.003)
+    voiced_thoreshold: F0 threshold for U/V classification (default 0)
+    flag_for_pulse: this SinGen is used inside PulseGen (default False)
+    Note: when flag_for_pulse is True, the first time step of a voiced
+        segment is always sin(np.pi) or cos(0)
+    """
+    def __init__(self, samp_rate, harmonic_num=0,
+                 sine_amp=0.1, noise_std=0.003,
+                 voiced_threshold=0):
+        super(SineGen, self).__init__()
+        self.sine_amp = sine_amp
+        self.noise_std = noise_std
+        self.harmonic_num = harmonic_num
+        self.dim = self.harmonic_num + 1
+        self.sampling_rate = samp_rate
+        self.voiced_threshold = voiced_threshold
+    def _f02uv(self, f0):
+        # generate uv signal
+        uv = torch.ones_like(f0)
+        uv = uv * (f0 > self.voiced_threshold)
+        return uv
+    @torch.no_grad()
+    def forward(self, f0, upp):
+        """ sine_tensor, uv = forward(f0)
+        input F0: tensor(batchsize=1, length, dim=1)
+                  f0 for unvoiced steps should be 0
+        output sine_tensor: tensor(batchsize=1, length, dim)
+        output uv: tensor(batchsize=1, length, 1)
+        """
+        f0 = f0.unsqueeze(-1)
+        fn = torch.multiply(f0, torch.arange(1, self.dim + 1, device=f0.device).reshape((1, 1, -1)))
+        rad_values = (fn / self.sampling_rate) % 1  ###%1意味着n_har的乘积无法后处理优化
+        rand_ini = torch.rand(fn.shape[0], fn.shape[2], device=fn.device)
+        rand_ini[:, 0] = 0
+        rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
+        is_half = rad_values.dtype is not torch.float32
+        tmp_over_one = torch.cumsum(rad_values.double(), 1)  # % 1  #####%1意味着后面的cumsum无法再优化
+        if is_half:
+            tmp_over_one = tmp_over_one.half()
+        else:
+            tmp_over_one = tmp_over_one.float()
+        tmp_over_one *= upp
+        tmp_over_one = F.interpolate(
+            tmp_over_one.transpose(2, 1), scale_factor=upp,
+            mode='linear', align_corners=True
+        ).transpose(2, 1)
+        rad_values = F.interpolate(rad_values.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
+        tmp_over_one %= 1
+        tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
+        cumsum_shift = torch.zeros_like(rad_values)
+        cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
+        rad_values = rad_values.double()
+        cumsum_shift = cumsum_shift.double()
+        sine_waves = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi)
+        if is_half:
+            sine_waves = sine_waves.half()
+        else:
+            sine_waves = sine_waves.float()
+        sine_waves = sine_waves * self.sine_amp
+        uv = self._f02uv(f0)
+        uv = F.interpolate(uv.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
+        noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
+        noise = noise_amp * torch.randn_like(sine_waves)
+        sine_waves = sine_waves * uv + noise
+        return sine_waves, uv, noise
+class SourceModuleHnNSF(torch.nn.Module):
+    """ SourceModule for hn-nsf
+    SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
+                 add_noise_std=0.003, voiced_threshod=0)
+    sampling_rate: sampling_rate in Hz
+    harmonic_num: number of harmonic above F0 (default: 0)
+    sine_amp: amplitude of sine source signal (default: 0.1)
+    add_noise_std: std of additive Gaussian noise (default: 0.003)
+        note that amplitude of noise in unvoiced is decided
+        by sine_amp
+    voiced_threshold: threhold to set U/V given F0 (default: 0)
+    Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
+    F0_sampled (batchsize, length, 1)
+    Sine_source (batchsize, length, 1)
+    noise_source (batchsize, length 1)
+    uv (batchsize, length, 1)
+    """
+    def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
+                 add_noise_std=0.003, voiced_threshod=0):
+        super(SourceModuleHnNSF, self).__init__()
+        self.sine_amp = sine_amp
+        self.noise_std = add_noise_std
+        # to produce sine waveforms
+        self.l_sin_gen = SineGen(sampling_rate, harmonic_num,
+                                 sine_amp, add_noise_std, voiced_threshod)
+        # to merge source harmonics into a single excitation
+        self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
+        self.l_tanh = torch.nn.Tanh()
+    def forward(self, x, upp):
+        sine_wavs, uv, _ = self.l_sin_gen(x, upp)
+        sine_merge = self.l_tanh(self.l_linear(sine_wavs))
+        return sine_merge
+class Generator(torch.nn.Module):
+    def __init__(self, h):
+        super(Generator, self).__init__()
+        self.h = h
+        self.num_kernels = len(h.resblock_kernel_sizes)
+        self.num_upsamples = len(h.upsample_rates)
+        self.m_source = SourceModuleHnNSF(
+            sampling_rate=h.sampling_rate,
+            harmonic_num=8
+        )
+        self.noise_convs = nn.ModuleList()
+        self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3))
+        resblock = ResBlock1 if h.resblock == '1' else ResBlock2
+        self.ups = nn.ModuleList()
+        for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+            c_cur = h.upsample_initial_channel // (2 ** (i + 1))
+            self.ups.append(weight_norm(
+                ConvTranspose1d(h.upsample_initial_channel // (2 ** i), h.upsample_initial_channel // (2 ** (i + 1)),
+                                k, u, padding=(k - u) // 2)))
+            if i + 1 < len(h.upsample_rates):  #
+                stride_f0 = int(np.prod(h.upsample_rates[i + 1:]))
+                self.noise_convs.append(Conv1d(
+                    1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
+            else:
+                self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
+        self.resblocks = nn.ModuleList()
+        ch = h.upsample_initial_channel
+        for i in range(len(self.ups)):
+            ch //= 2
+            for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
+                self.resblocks.append(resblock(h, ch, k, d))
+        self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
+        self.ups.apply(init_weights)
+        self.conv_post.apply(init_weights)
+        self.upp = int(np.prod(h.upsample_rates))
+    def forward(self, x, f0):
+        har_source = self.m_source(f0, self.upp).transpose(1, 2)
+        x = self.conv_pre(x)
+        for i in range(self.num_upsamples):
+            x = F.leaky_relu(x, LRELU_SLOPE)
+            x = self.ups[i](x)
+            x_source = self.noise_convs[i](har_source)
+            x = x + x_source
+            xs = None
+            for j in range(self.num_kernels):
+                if xs is None:
+                    xs = self.resblocks[i * self.num_kernels + j](x)
+                else:
+                    xs += self.resblocks[i * self.num_kernels + j](x)
+            x = xs / self.num_kernels
+        x = F.leaky_relu(x)
+        x = self.conv_post(x)
+        x = torch.tanh(x)
+        return x
+    def remove_weight_norm(self):
+        print('Removing weight norm...')
+        for l in self.ups:
+            remove_weight_norm(l)
+        for l in self.resblocks:
+            l.remove_weight_norm()
+        remove_weight_norm(self.conv_pre)
+        remove_weight_norm(self.conv_post)
+class DiscriminatorP(torch.nn.Module):
+    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
+        super(DiscriminatorP, self).__init__()
+        self.period = period
+        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+        self.convs = nn.ModuleList([
+            norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
+        ])
+        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
+    def forward(self, x):
+        fmap = []
+        # 1d to 2d
+        b, c, t = x.shape
+        if t % self.period != 0:  # pad first
+            n_pad = self.period - (t % self.period)
+            x = F.pad(x, (0, n_pad), "reflect")
+            t = t + n_pad
+        x = x.view(b, c, t // self.period, self.period)
+        for l in self.convs:
+            x = l(x)
+            x = F.leaky_relu(x, LRELU_SLOPE)
+            fmap.append(x)
+        x = self.conv_post(x)
+        fmap.append(x)
+        x = torch.flatten(x, 1, -1)
+        return x, fmap
+class MultiPeriodDiscriminator(torch.nn.Module):
+    def __init__(self, periods=None):
+        super(MultiPeriodDiscriminator, self).__init__()
+        self.periods = periods if periods is not None else [2, 3, 5, 7, 11]
+        self.discriminators = nn.ModuleList()
+        for period in self.periods:
+            self.discriminators.append(DiscriminatorP(period))
+    def forward(self, y, y_hat):
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for i, d in enumerate(self.discriminators):
+            y_d_r, fmap_r = d(y)
+            y_d_g, fmap_g = d(y_hat)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+class DiscriminatorS(torch.nn.Module):
+    def __init__(self, use_spectral_norm=False):
+        super(DiscriminatorS, self).__init__()
+        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+        self.convs = nn.ModuleList([
+            norm_f(Conv1d(1, 128, 15, 1, padding=7)),
+            norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
+            norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
+            norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
+            norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
+            norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
+            norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
+        ])
+        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
+    def forward(self, x):
+        fmap = []
+        for l in self.convs:
+            x = l(x)
+            x = F.leaky_relu(x, LRELU_SLOPE)
+            fmap.append(x)
+        x = self.conv_post(x)
+        fmap.append(x)
+        x = torch.flatten(x, 1, -1)
+        return x, fmap
+class MultiScaleDiscriminator(torch.nn.Module):
+    def __init__(self):
+        super(MultiScaleDiscriminator, self).__init__()
+        self.discriminators = nn.ModuleList([
+            DiscriminatorS(use_spectral_norm=True),
+            DiscriminatorS(),
+            DiscriminatorS(),
+        ])
+        self.meanpools = nn.ModuleList([
+            AvgPool1d(4, 2, padding=2),
+            AvgPool1d(4, 2, padding=2)
+        ])
+    def forward(self, y, y_hat):
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for i, d in enumerate(self.discriminators):
+            if i != 0:
+                y = self.meanpools[i - 1](y)
+                y_hat = self.meanpools[i - 1](y_hat)
+            y_d_r, fmap_r = d(y)
+            y_d_g, fmap_g = d(y_hat)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+def feature_loss(fmap_r, fmap_g):
+    loss = 0
+    for dr, dg in zip(fmap_r, fmap_g):
+        for rl, gl in zip(dr, dg):
+            loss += torch.mean(torch.abs(rl - gl))
+    return loss * 2
+def discriminator_loss(disc_real_outputs, disc_generated_outputs):
+    loss = 0
+    r_losses = []
+    g_losses = []
+    for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+        r_loss = torch.mean((1 - dr) ** 2)
+        g_loss = torch.mean(dg ** 2)
+        loss += (r_loss + g_loss)
+        r_losses.append(r_loss.item())
+        g_losses.append(g_loss.item())
+    return loss, r_losses, g_losses
+def generator_loss(disc_outputs):
+    loss = 0
+    gen_losses = []
+    for dg in disc_outputs:
+        l = torch.mean((1 - dg) ** 2)
+        gen_losses.append(l)
+        loss += l
+    return loss, gen_losses

nsf_hifigan/nvSTFT.py ADDED Viewed

	@@ -0,0 +1,134 @@

+import math
+import os
+os.environ["LRU_CACHE_CAPACITY"] = "3"
+import random
+import torch
+import torch.utils.data
+import numpy as np
+import librosa
+from librosa.util import normalize
+from librosa.filters import mel as librosa_mel_fn
+from scipy.io.wavfile import read
+import soundfile as sf
+import torch.nn.functional as F
+def load_wav_to_torch(full_path, target_sr=None, return_empty_on_exception=False):
+    sampling_rate = None
+    try:
+        data, sampling_rate = sf.read(full_path, always_2d=True)# than soundfile.
+    except Exception as ex:
+        print(f"'{full_path}' failed to load.\nException:")
+        print(ex)
+        if return_empty_on_exception:
+            return [], sampling_rate or target_sr or 48000
+        else:
+            raise Exception(ex)
+    if len(data.shape) > 1:
+        data = data[:, 0]
+        assert len(data) > 2# check duration of audio file is > 2 samples (because otherwise the slice operation was on the wrong dimension)
+    if np.issubdtype(data.dtype, np.integer): # if audio data is type int
+        max_mag = -np.iinfo(data.dtype).min # maximum magnitude = min possible value of intXX
+    else: # if audio data is type fp32
+        max_mag = max(np.amax(data), -np.amin(data))
+        max_mag = (2**31)+1 if max_mag > (2**15) else ((2**15)+1 if max_mag > 1.01 else 1.0) # data should be either 16-bit INT, 32-bit INT or [-1 to 1] float32
+    data = torch.FloatTensor(data.astype(np.float32))/max_mag
+    if (torch.isinf(data) | torch.isnan(data)).any() and return_empty_on_exception:# resample will crash with inf/NaN inputs. return_empty_on_exception will return empty arr instead of except
+        return [], sampling_rate or target_sr or 48000
+    if target_sr is not None and sampling_rate != target_sr:
+        data = torch.from_numpy(librosa.core.resample(data.numpy(), orig_sr=sampling_rate, target_sr=target_sr))
+        sampling_rate = target_sr
+    return data, sampling_rate
+def dynamic_range_compression(x, C=1, clip_val=1e-5):
+    return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
+def dynamic_range_decompression(x, C=1):
+    return np.exp(x) / C
+def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
+    return torch.log(torch.clamp(x, min=clip_val) * C)
+def dynamic_range_decompression_torch(x, C=1):
+    return torch.exp(x) / C
+class STFT():
+    def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop_length=256, fmin=20, fmax=11025, clip_val=1e-5):
+        self.target_sr = sr
+        self.n_mels     = n_mels
+        self.n_fft      = n_fft
+        self.win_size   = win_size
+        self.hop_length = hop_length
+        self.fmin     = fmin
+        self.fmax     = fmax
+        self.clip_val = clip_val
+        self.mel_basis = {}
+        self.hann_window = {}
+    def get_mel(self, y, keyshift=0, speed=1, center=False):
+        sampling_rate = self.target_sr
+        n_mels     = self.n_mels
+        n_fft      = self.n_fft
+        win_size   = self.win_size
+        hop_length = self.hop_length
+        fmin       = self.fmin
+        fmax       = self.fmax
+        clip_val   = self.clip_val
+        factor = 2 ** (keyshift / 12)
+        n_fft_new = int(np.round(n_fft * factor))
+        win_size_new = int(np.round(win_size * factor))
+        hop_length_new = int(np.round(hop_length * speed))
+        if torch.min(y) < -1.:
+            print('min value is ', torch.min(y))
+        if torch.max(y) > 1.:
+            print('max value is ', torch.max(y))
+        mel_basis_key = str(fmax)+'_'+str(y.device)
+        if mel_basis_key not in self.mel_basis:
+            mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
+            self.mel_basis[mel_basis_key] = torch.from_numpy(mel).float().to(y.device)
+        keyshift_key = str(keyshift)+'_'+str(y.device)
+        if keyshift_key not in self.hann_window:
+            self.hann_window[keyshift_key] = torch.hann_window(win_size_new).to(y.device)
+        pad_left = (win_size_new - hop_length_new) //2
+        pad_right = max((win_size_new- hop_length_new + 1) //2, win_size_new - y.size(-1) - pad_left)
+        if pad_right < y.size(-1):
+            mode = 'reflect'
+        else:
+            mode = 'constant'
+        y = torch.nn.functional.pad(y.unsqueeze(1), (pad_left, pad_right), mode = mode)
+        y = y.squeeze(1)
+        spec = torch.stft(y, n_fft_new, hop_length=hop_length_new, win_length=win_size_new, window=self.hann_window[keyshift_key],
+                          center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
+        # print(111,spec)
+        spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
+        if keyshift != 0:
+            size = n_fft // 2 + 1
+            resize = spec.size(1)
+            if resize < size:
+                spec = F.pad(spec, (0, 0, 0, size-resize))
+            spec = spec[:, :size, :] * win_size / win_size_new
+        # print(222,spec)
+        spec = torch.matmul(self.mel_basis[mel_basis_key], spec)
+        # print(333,spec)
+        spec = dynamic_range_compression_torch(spec, clip_val=clip_val)
+        # print(444,spec)
+        return spec
+    def __call__(self, audiopath):
+        audio, sr = load_wav_to_torch(audiopath, target_sr=self.target_sr)
+        spect = self.get_mel(audio.unsqueeze(0)).squeeze(0)
+        return spect
+stft = STFT()

nsf_hifigan/utils.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import glob
+import os
+import matplotlib
+import torch
+from torch.nn.utils import weight_norm
+matplotlib.use("Agg")
+import matplotlib.pylab as plt
+def plot_spectrogram(spectrogram):
+    fig, ax = plt.subplots(figsize=(10, 2))
+    im = ax.imshow(spectrogram, aspect="auto", origin="lower",
+                   interpolation='none')
+    plt.colorbar(im, ax=ax)
+    fig.canvas.draw()
+    plt.close()
+    return fig
+def init_weights(m, mean=0.0, std=0.01):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        m.weight.data.normal_(mean, std)
+def apply_weight_norm(m):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        weight_norm(m)
+def get_padding(kernel_size, dilation=1):
+    return int((kernel_size*dilation - dilation)/2)
+def load_checkpoint(filepath, device):
+    assert os.path.isfile(filepath)
+    print("Loading '{}'".format(filepath))
+    checkpoint_dict = torch.load(filepath, map_location=device)
+    print("Complete.")
+    return checkpoint_dict
+def save_checkpoint(filepath, obj):
+    print("Saving checkpoint to {}".format(filepath))
+    torch.save(obj, filepath)
+    print("Complete.")
+def del_old_checkpoints(cp_dir, prefix, n_models=2):
+    pattern = os.path.join(cp_dir, prefix + '????????')
+    cp_list = glob.glob(pattern) # get checkpoint paths
+    cp_list = sorted(cp_list)# sort by iter
+    if len(cp_list) > n_models: # if more than n_models models are found
+        for cp in cp_list[:-n_models]:# delete the oldest models other than lastest n_models
+            open(cp, 'w').close()# empty file contents
+            os.unlink(cp)# delete file (move to trash when using Colab)
+def scan_checkpoint(cp_dir, prefix):
+    pattern = os.path.join(cp_dir, prefix + '????????')
+    cp_list = glob.glob(pattern)
+    if len(cp_list) == 0:
+        return None
+    return sorted(cp_list)[-1]

preprocess.py ADDED Viewed

	@@ -0,0 +1,133 @@

+import os
+import numpy as np
+import librosa
+import torch
+import pyworld as pw
+import parselmouth
+import argparse
+import shutil
+from logger import utils
+from tqdm import tqdm
+from ddsp.vocoder import F0_Extractor, Volume_Extractor, Units_Encoder
+from logger.utils import traverse_dir
+import concurrent.futures
+def parse_args(args=None, namespace=None):
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-c",
+        "--config",
+        type=str,
+        required=True,
+        help="path to the config file")
+    return parser.parse_args(args=args, namespace=namespace)
+def preprocess(path, f0_extractor, volume_extractor, units_encoder, sample_rate, hop_size, device = 'cuda'):
+    path_srcdir  = os.path.join(path, 'audio')
+    path_unitsdir  = os.path.join(path, 'units')
+    path_f0dir  = os.path.join(path, 'f0')
+    path_volumedir  = os.path.join(path, 'volume')
+    path_skipdir = os.path.join(path, 'skip')
+    # list files
+    filelist =  traverse_dir(
+        path_srcdir,
+        extension='wav',
+        is_pure=True,
+        is_sort=True,
+        is_ext=True)
+    # run
+    def process(file):
+        ext = file.split('.')[-1]
+        binfile = file[:-(len(ext)+1)]+'.npy'
+        path_srcfile = os.path.join(path_srcdir, file)
+        path_unitsfile = os.path.join(path_unitsdir, binfile)
+        path_f0file = os.path.join(path_f0dir, binfile)
+        path_volumefile = os.path.join(path_volumedir, binfile)
+        path_skipfile = os.path.join(path_skipdir, file)
+        # load audio
+        audio, _ = librosa.load(path_srcfile, sr=sample_rate)
+        if len(audio.shape) > 1:
+            audio = librosa.to_mono(audio)
+        audio_t = torch.from_numpy(audio).float().to(device)
+        audio_t = audio_t.unsqueeze(0)
+        # extract volume
+        volume = volume_extractor.extract(audio)
+        # units encode
+        units_t = units_encoder.encode(audio_t, sample_rate, hop_size)
+        units = units_t.squeeze().to('cpu').numpy()
+        # extract f0
+        f0 = f0_extractor.extract(audio, uv_interp = False)
+        uv = f0 == 0
+        if len(f0[~uv]) > 0:
+            # interpolate the unvoiced f0
+            f0[uv] = np.interp(np.where(uv)[0], np.where(~uv)[0], f0[~uv])
+            # save npy
+            os.makedirs(os.path.dirname(path_unitsfile), exist_ok=True)
+            np.save(path_unitsfile, units)
+            os.makedirs(os.path.dirname(path_f0file), exist_ok=True)
+            np.save(path_f0file, f0)
+            os.makedirs(os.path.dirname(path_volumefile), exist_ok=True)
+            np.save(path_volumefile, volume)
+        else:
+            print('\n[Error] F0 extraction failed: ' + path_srcfile)
+            os.makedirs(os.path.dirname(path_skipfile), exist_ok=True)
+            shutil.move(path_srcfile, os.path.dirname(path_skipfile))
+            print('This file has been moved to ' + path_skipfile)
+    print('Preprocess the audio clips in :', path_srcdir)
+    # single process
+    for file in tqdm(filelist, total=len(filelist)):
+        process(file)
+    # multi-process (have bugs)
+    '''
+    with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:
+        list(tqdm(executor.map(process, filelist), total=len(filelist)))
+    '''
+if __name__ == '__main__':
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    # parse commands
+    cmd = parse_args()
+    # load config
+    args = utils.load_config(cmd.config)
+    sample_rate = args.data.sampling_rate
+    hop_size = args.data.block_size
+    # initialize f0 extractor
+    f0_extractor = F0_Extractor(
+                        args.data.f0_extractor,
+                        args.data.sampling_rate,
+                        args.data.block_size,
+                        args.data.f0_min,
+                        args.data.f0_max)
+    # initialize volume extractor
+    volume_extractor = Volume_Extractor(args.data.block_size)
+    # initialize units encoder
+    units_encoder = Units_Encoder(
+                        args.data.encoder,
+                        args.data.encoder_ckpt,
+                        args.data.encoder_sample_rate,
+                        args.data.encoder_hop_size,
+                        device = device)
+    # preprocess training set
+    preprocess(args.data.train_path, f0_extractor, volume_extractor, units_encoder, sample_rate, hop_size, device = device)
+    # preprocess validation set
+    preprocess(args.data.valid_path, f0_extractor, volume_extractor, units_encoder, sample_rate, hop_size, device = device)

pretrain/gitkeep ADDED Viewed

File without changes

requirements.txt ADDED Viewed

	@@ -0,0 +1,23 @@

+einops
+fairseq
+flask
+flask_cors
+gin
+gin_config
+librosa
+local_attention
+matplotlib
+numpy
+praat-parselmouth
+pyworld
+PyYAML
+resampy
+scikit_learn
+scipy
+SoundFile
+tensorboard
+torchcrepe
+tqdm
+wave
+pysimplegui
+sounddevice

samples/source.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22422561a72d7bcb588503be9a1188057f5ebd910c796f7c77c268f484de9115
+size 3087746

samples/svc-kiritan+12key.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f009b74e725aade5acc0902f72085e2a6cb63e3ff7db21e8662f8521ebca18c1
+size 2830380

samples/svc-opencpop+12key.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e41b18c94d12ef0a1b1d0cfcce87ccf3db5da931f4b3718a26c0a2c018d19ba1
+size 2830380

samples/svc-opencpop_kiritan_mix+12key.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f531ec9601371358da2b83eeb5326482ba534aba9e97d66a5aad5602118ce09
+size 2830380

slicer.py ADDED Viewed

	@@ -0,0 +1,146 @@

+import librosa
+import torch
+import torchaudio
+class Slicer:
+    def __init__(self,
+                 sr: int,
+                 threshold: float = -40.,
+                 min_length: int = 5000,
+                 min_interval: int = 300,
+                 hop_size: int = 20,
+                 max_sil_kept: int = 5000):
+        if not min_length >= min_interval >= hop_size:
+            raise ValueError('The following condition must be satisfied: min_length >= min_interval >= hop_size')
+        if not max_sil_kept >= hop_size:
+            raise ValueError('The following condition must be satisfied: max_sil_kept >= hop_size')
+        min_interval = sr * min_interval / 1000
+        self.threshold = 10 ** (threshold / 20.)
+        self.hop_size = round(sr * hop_size / 1000)
+        self.win_size = min(round(min_interval), 4 * self.hop_size)
+        self.min_length = round(sr * min_length / 1000 / self.hop_size)
+        self.min_interval = round(min_interval / self.hop_size)
+        self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)
+    def _apply_slice(self, waveform, begin, end):
+        if len(waveform.shape) > 1:
+            return waveform[:, begin * self.hop_size: min(waveform.shape[1], end * self.hop_size)]
+        else:
+            return waveform[begin * self.hop_size: min(waveform.shape[0], end * self.hop_size)]
+    # @timeit
+    def slice(self, waveform):
+        if len(waveform.shape) > 1:
+            samples = librosa.to_mono(waveform)
+        else:
+            samples = waveform
+        if samples.shape[0] <= self.min_length:
+            return {"0": {"slice": False, "split_time": f"0,{len(waveform)}"}}
+        rms_list = librosa.feature.rms(y=samples, frame_length=self.win_size, hop_length=self.hop_size).squeeze(0)
+        sil_tags = []
+        silence_start = None
+        clip_start = 0
+        for i, rms in enumerate(rms_list):
+            # Keep looping while frame is silent.
+            if rms < self.threshold:
+                # Record start of silent frames.
+                if silence_start is None:
+                    silence_start = i
+                continue
+            # Keep looping while frame is not silent and silence start has not been recorded.
+            if silence_start is None:
+                continue
+            # Clear recorded silence start if interval is not enough or clip is too short
+            is_leading_silence = silence_start == 0 and i > self.max_sil_kept
+            need_slice_middle = i - silence_start >= self.min_interval and i - clip_start >= self.min_length
+            if not is_leading_silence and not need_slice_middle:
+                silence_start = None
+                continue
+            # Need slicing. Record the range of silent frames to be removed.
+            if i - silence_start <= self.max_sil_kept:
+                pos = rms_list[silence_start: i + 1].argmin() + silence_start
+                if silence_start == 0:
+                    sil_tags.append((0, pos))
+                else:
+                    sil_tags.append((pos, pos))
+                clip_start = pos
+            elif i - silence_start <= self.max_sil_kept * 2:
+                pos = rms_list[i - self.max_sil_kept: silence_start + self.max_sil_kept + 1].argmin()
+                pos += i - self.max_sil_kept
+                pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
+                pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
+                if silence_start == 0:
+                    sil_tags.append((0, pos_r))
+                    clip_start = pos_r
+                else:
+                    sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
+                    clip_start = max(pos_r, pos)
+            else:
+                pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
+                pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
+                if silence_start == 0:
+                    sil_tags.append((0, pos_r))
+                else:
+                    sil_tags.append((pos_l, pos_r))
+                clip_start = pos_r
+            silence_start = None
+        # Deal with trailing silence.
+        total_frames = rms_list.shape[0]
+        if silence_start is not None and total_frames - silence_start >= self.min_interval:
+            silence_end = min(total_frames, silence_start + self.max_sil_kept)
+            pos = rms_list[silence_start: silence_end + 1].argmin() + silence_start
+            sil_tags.append((pos, total_frames + 1))
+        # Apply and return slices.
+        if len(sil_tags) == 0:
+            return {"0": {"slice": False, "split_time": f"0,{len(waveform)}"}}
+        else:
+            chunks = []
+            # 第一段静音并非从头开始，补上有声片段
+            if sil_tags[0][0]:
+                chunks.append(
+                    {"slice": False, "split_time": f"0,{min(waveform.shape[0], sil_tags[0][0] * self.hop_size)}"})
+            for i in range(0, len(sil_tags)):
+                # 标识有声片段（跳过第一段）
+                if i:
+                    chunks.append({"slice": False,
+                                   "split_time": f"{sil_tags[i - 1][1] * self.hop_size},{min(waveform.shape[0], sil_tags[i][0] * self.hop_size)}"})
+                # 标识所有静音片段
+                chunks.append({"slice": True,
+                               "split_time": f"{sil_tags[i][0] * self.hop_size},{min(waveform.shape[0], sil_tags[i][1] * self.hop_size)}"})
+            # 最后一段静音并非结尾，补上结尾片段
+            if sil_tags[-1][1] * self.hop_size < len(waveform):
+                chunks.append({"slice": False, "split_time": f"{sil_tags[-1][1] * self.hop_size},{len(waveform)}"})
+            chunk_dict = {}
+            for i in range(len(chunks)):
+                chunk_dict[str(i)] = chunks[i]
+            return chunk_dict
+def cut(audio_path, db_thresh=-30, min_len=5000, flask_mode=False, flask_sr=None):
+    if not flask_mode:
+        audio, sr = librosa.load(audio_path, sr=None)
+    else:
+        audio = audio_path
+        sr = flask_sr
+    slicer = Slicer(
+        sr=sr,
+        threshold=db_thresh,
+        min_length=min_len
+    )
+    chunks = slicer.slice(audio)
+    return chunks
+def chunks2audio(audio_path, chunks):
+    chunks = dict(chunks)
+    audio, sr = torchaudio.load(audio_path)
+    if len(audio.shape) == 2 and audio.shape[1] >= 2:
+        audio = torch.mean(audio, dim=0).unsqueeze(0)
+    audio = audio.cpu().numpy()[0]
+    result = []
+    for k, v in chunks.items():
+        tag = v["split_time"].split(",")
+        if tag[0] != tag[1]:
+            result.append((v["slice"], audio[int(tag[0]):int(tag[1])]))
+    return result, sr

solver.py ADDED Viewed

	@@ -0,0 +1,151 @@

+import os
+import time
+import numpy as np
+import torch
+from logger.saver import Saver
+from logger import utils
+def test(args, model, loss_func, loader_test, saver):
+    print(' [*] testing...')
+    model.eval()
+    # losses
+    test_loss = 0.
+    test_loss_rss = 0.
+    test_loss_uv = 0.
+    # intialization
+    num_batches = len(loader_test)
+    rtf_all = []
+    # run
+    with torch.no_grad():
+        for bidx, data in enumerate(loader_test):
+            fn = data['name'][0]
+            print('--------')
+            print('{}/{} - {}'.format(bidx, num_batches, fn))
+            # unpack data
+            for k in data.keys():
+                if k != 'name':
+                    data[k] = data[k].to(args.device)
+            print('>>', data['name'][0])
+            # forward
+            st_time = time.time()
+            signal, _, (s_h, s_n) = model(data['units'], data['f0'], data['volume'], data['spk_id'])
+            ed_time = time.time()
+            # crop
+            min_len = np.min([signal.shape[1], data['audio'].shape[1]])
+            signal        = signal[:,:min_len]
+            data['audio'] = data['audio'][:,:min_len]
+            # RTF
+            run_time = ed_time - st_time
+            song_time = data['audio'].shape[-1] / args.data.sampling_rate
+            rtf = run_time / song_time
+            print('RTF: {}  | {} / {}'.format(rtf, run_time, song_time))
+            rtf_all.append(rtf)
+            # loss
+            loss = loss_func(signal, data['audio'])
+            test_loss += loss.item()
+            # log
+            saver.log_audio({fn+'/gt.wav': data['audio'], fn+'/pred.wav': signal})
+    # report
+    test_loss /= num_batches
+    # check
+    print(' [test_loss] test_loss:', test_loss)
+    print(' Real Time Factor', np.mean(rtf_all))
+    return test_loss
+def train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_test):
+    # saver
+    saver = Saver(args, initial_global_step=initial_global_step)
+    # model size
+    params_count = utils.get_network_paras_amount({'model': model})
+    saver.log_info('--- model size ---')
+    saver.log_info(params_count)
+    # run
+    best_loss = np.inf
+    num_batches = len(loader_train)
+    model.train()
+    saver.log_info('======= start training =======')
+    for epoch in range(args.train.epochs):
+        for batch_idx, data in enumerate(loader_train):
+            saver.global_step_increment()
+            optimizer.zero_grad()
+            # unpack data
+            for k in data.keys():
+                if k != 'name':
+                    data[k] = data[k].to(args.device)
+            # forward
+            signal, _, (s_h, s_n) = model(data['units'].float(), data['f0'], data['volume'], data['spk_id'], infer=False)
+            # loss
+            loss = loss_func(signal, data['audio'])
+            # handle nan loss
+            if torch.isnan(loss):
+                raise ValueError(' [x] nan loss ')
+            else:
+                # backpropagate
+                loss.backward()
+                optimizer.step()
+            # log loss
+            if saver.global_step % args.train.interval_log == 0:
+                saver.log_info(
+                    'epoch: {} | {:3d}/{:3d} | {} | batch/s: {:.2f} | loss: {:.3f} | time: {} | step: {}'.format(
+                        epoch,
+                        batch_idx,
+                        num_batches,
+                        args.env.expdir,
+                        args.train.interval_log/saver.get_interval_time(),
+                        loss.item(),
+                        saver.get_total_time(),
+                        saver.global_step
+                    )
+                )
+                saver.log_value({
+                    'train/loss': loss.item()
+                })
+            # validation
+            if saver.global_step % args.train.interval_val == 0:
+                # save latest
+                saver.save_model(model, optimizer, postfix=f'{saver.global_step}')
+                # run testing set
+                test_loss = test(args, model, loss_func, loader_test, saver)
+                saver.log_info(
+                    ' --- <validation> --- \nloss: {:.3f}. '.format(
+                        test_loss,
+                    )
+                )
+                saver.log_value({
+                    'validation/loss': test_loss
+                })
+                model.train()
+                # save best model
+                if test_loss < best_loss:
+                    saver.log_info(' [V] best model updated.')
+                    saver.save_model(model, optimizer, postfix='best')
+                    best_loss = test_loss

train.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import os
+import argparse
+import torch
+from logger import utils
+from data_loaders import get_data_loaders
+from solver import train
+from ddsp.vocoder import Sins, CombSub, CombSubFast
+from ddsp.loss import RSSLoss
+def parse_args(args=None, namespace=None):
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-c",
+        "--config",
+        type=str,
+        required=True,
+        help="path to the config file")
+    return parser.parse_args(args=args, namespace=namespace)
+if __name__ == '__main__':
+    # parse commands
+    cmd = parse_args()
+    # load config
+    args = utils.load_config(cmd.config)
+    print(' > config:', cmd.config)
+    print(' >    exp:', args.env.expdir)
+    # load model
+    model = None
+    if args.model.type == 'Sins':
+        model = Sins(
+            sampling_rate=args.data.sampling_rate,
+            block_size=args.data.block_size,
+            n_harmonics=args.model.n_harmonics,
+            n_mag_allpass=args.model.n_mag_allpass,
+            n_mag_noise=args.model.n_mag_noise,
+            n_unit=args.data.encoder_out_channels,
+            n_spk=args.model.n_spk)
+    elif args.model.type == 'CombSub':
+        model = CombSub(
+            sampling_rate=args.data.sampling_rate,
+            block_size=args.data.block_size,
+            n_mag_allpass=args.model.n_mag_allpass,
+            n_mag_harmonic=args.model.n_mag_harmonic,
+            n_mag_noise=args.model.n_mag_noise,
+            n_unit=args.data.encoder_out_channels,
+            n_spk=args.model.n_spk)
+    elif args.model.type == 'CombSubFast':
+        model = CombSubFast(
+            sampling_rate=args.data.sampling_rate,
+            block_size=args.data.block_size,
+            n_unit=args.data.encoder_out_channels,
+            n_spk=args.model.n_spk)
+    else:
+        raise ValueError(f" [x] Unknown Model: {args.model.type}")
+    # load parameters
+    optimizer = torch.optim.AdamW(model.parameters())
+    initial_global_step, model, optimizer = utils.load_model(args.env.expdir, model, optimizer, device=args.device)
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = args.train.lr
+        param_group['weight_decay'] = args.train.weight_decay
+    # loss
+    loss_func = RSSLoss(args.loss.fft_min, args.loss.fft_max, args.loss.n_scale, device = args.device)
+    # device
+    if args.device == 'cuda':
+        torch.cuda.set_device(args.env.gpu_id)
+    model.to(args.device)
+    for state in optimizer.state.values():
+        for k, v in state.items():
+            if torch.is_tensor(v):
+                state[k] = v.to(args.device)
+    loss_func.to(args.device)
+    # datas
+    loader_train, loader_valid = get_data_loaders(args, whole_audio=False)
+    # run
+    train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_valid)