Spaces:

Vaishnavi0404
/

Text2Sing-DiffSinger

Running

File size: 5,608 Bytes

c8baff6

# DiffSinger-PNDM
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)

Highlights:

Training diffusion model: 1000 steps 

Default pndm_speedup: 40



Inference diffusion model: (1000 / pndm_speedup) steps = 25 steps

You can freely control the inference steps, by adding these arguments in your experiment scripts :
--hparams="pndm_speedup=40" or --hparams="pndm_speedup=20" or --hparams="pndm_speedup=10".



Contributed by @luping-liu .



## DiffSinger (MIDI SVS | B version | +PNDM)

### 0. Data Acquirement

For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.



The pipeline below is designed for Opencpop dataset:



### 1. Preparation



#### Data Preparation

a) Download and extract Opencpop, then create a link to the dataset folder: `ln -s /xxx/opencpop data/raw/`



b) Run the following scripts to pack the dataset for training/inference.



```sh

export PYTHONPATH=.

CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml



# `data/binary/opencpop-midi-dp` will be generated.

```



#### Vocoder Preparation

We provide the pre-trained model of [HifiGAN-Singing](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0109_hifigan_bigpopcs_hop128.zip) which is specially designed for SVS with NSF mechanism.



Also, please unzip pre-trained vocoder and [this pendant for vocoder](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0102_xiaoma_pe.zip) into `checkpoints` before training your acoustic model.



(Update: You can also move [a ckpt with more training steps](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/model_ckpt_steps_1512000.ckpt) into this vocoder directory)



This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder. 



#### Exp Name Preparation

```bash

export MY_DS_EXP_NAME=0831_opencpop_ds1000
```



```
.
|--data
    |--raw

        |--opencpop

            |--segments

                |--transcriptions.txt

                |--wavs

|--checkpoints

    |--MY_DS_EXP_NAME (optional)

    |--0109_hifigan_bigpopcs_hop128 (vocoder)

        |--model_ckpt_steps_1512000.ckpt

        |--config.yaml

```


### 2. Training Example
```sh

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds1000.yaml --exp_name $MY_DS_EXP_NAME --reset  

```

### 3. Inference from packed test set
```sh

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds1000.yaml --exp_name $MY_DS_EXP_NAME --reset --infer

```
Inference results will be saved in `./checkpoints/MY_DS_EXP_NAME/generated_` by default.

We also provide:
 - the pre-trained model of DiffSinger;
 
They can be found in [here](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0831_opencpop_ds1000.zip).

Remember to put the pre-trained models in `checkpoints` directory.

### 4. Inference from raw inputs
```sh

python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds1000.yaml --exp_name $MY_DS_EXP_NAME

```
Raw inputs:
```

inp = {

        'text': '小酒窝长睫毛AP是你最美的记号',

        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',

        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',

        'input_type': 'word'

    }  # user input: Chinese characters

or,

inp = {

        'text': '小酒窝长睫毛AP是你最美的记号',

        'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',

        'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',

        'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',

        'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',

        'input_type': 'phoneme'

    }  # input like Opencpop dataset.

```
Here the inference results will be saved in `./infer_out` by default.
### 5. Some issues.
a) the HifiGAN-Singing is trained on our [vocoder dataset](https://dl.acm.org/doi/abs/10.1145/3474085.3475437) and the training set of [PopCS](https://arxiv.org/abs/2105.02446). Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.

b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[ph_dur]) to predict phoneme duration. F0 curve is implicitly predicted together with mel-spectrogram.