Spaces:
Runtime error
Runtime error
release to pypi
Browse files- .gitignore +1 -0
- README.md +74 -60
- setup.cfg +1 -0
.gitignore
CHANGED
|
@@ -5,6 +5,7 @@ data
|
|
| 5 |
models
|
| 6 |
flagged
|
| 7 |
build
|
|
|
|
| 8 |
audiodiffusion.egg-info
|
| 9 |
lightning_logs
|
| 10 |
taming
|
|
|
|
| 5 |
models
|
| 6 |
flagged
|
| 7 |
build
|
| 8 |
+
dist
|
| 9 |
audiodiffusion.egg-info
|
| 10 |
lightning_logs
|
| 11 |
taming
|
README.md
CHANGED
|
@@ -11,7 +11,7 @@ license: gpl-3.0
|
|
| 11 |
---
|
| 12 |
# audio-diffusion [](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
---
|
| 17 |
|
|
@@ -41,7 +41,6 @@ A DDPM is trained on a set of mel spectrograms that have been generated from a d
|
|
| 41 |
|
| 42 |
You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
|
| 43 |
|
| 44 |
-
|
| 45 |
| Model | Dataset | Description |
|
| 46 |
|-------|---------|-------------|
|
| 47 |
| [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
|
|
@@ -54,117 +53,132 @@ You can play around with some pre-trained models on [Google Colab](https://colab
|
|
| 54 |
---
|
| 55 |
|
| 56 |
## Generate Mel spectrogram dataset from directory of audio files
|
|
|
|
| 57 |
#### Install
|
|
|
|
| 58 |
```bash
|
| 59 |
pip install .
|
| 60 |
```
|
| 61 |
|
| 62 |
-
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results
|
|
|
|
| 63 |
```bash
|
| 64 |
python scripts/audio_to_images.py \
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
```
|
| 70 |
|
| 71 |
-
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`)
|
|
|
|
| 72 |
```bash
|
| 73 |
python scripts/audio_to_images.py \
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
```
|
| 79 |
|
| 80 |
Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
|
| 81 |
|
| 82 |
## Train model
|
| 83 |
-
|
|
|
|
|
|
|
| 84 |
```bash
|
| 85 |
accelerate launch --config_file config/accelerate_local.yaml \
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
```
|
| 97 |
|
| 98 |
-
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub
|
|
|
|
| 99 |
```bash
|
| 100 |
accelerate launch --config_file config/accelerate_local.yaml \
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
```
|
| 115 |
|
| 116 |
-
#### Run training on SageMaker
|
|
|
|
| 117 |
```bash
|
| 118 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
```
|
| 129 |
|
| 130 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
|
|
|
| 131 |
#### A DDIM can be trained by adding the parameter
|
|
|
|
| 132 |
```bash
|
| 133 |
-
|
| 134 |
```
|
| 135 |
|
| 136 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
| 137 |
|
| 138 |
## Latent Audio Diffusion
|
|
|
|
| 139 |
Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
| 140 |
|
| 141 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
| 142 |
|
| 143 |
-
#### Train latent diffusion model using pre-trained VAE
|
|
|
|
| 144 |
```bash
|
| 145 |
accelerate launch ...
|
| 146 |
-
|
| 147 |
-
|
| 148 |
```
|
| 149 |
|
| 150 |
-
#### Install dependencies to train with Stable Diffusion
|
| 151 |
-
|
|
|
|
| 152 |
pip install omegaconf pytorch_lightning
|
| 153 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
| 154 |
pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
|
| 155 |
```
|
| 156 |
|
| 157 |
-
#### Train an autoencoder
|
|
|
|
| 158 |
```bash
|
| 159 |
python scripts/train_vae.py \
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
```
|
| 164 |
|
| 165 |
-
#### Train latent diffusion model
|
|
|
|
| 166 |
```bash
|
| 167 |
accelerate launch ...
|
| 168 |
-
|
| 169 |
-
|
| 170 |
```
|
|
|
|
| 11 |
---
|
| 12 |
# audio-diffusion [](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
|
| 13 |
|
| 14 |
+
## Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package
|
| 15 |
|
| 16 |
---
|
| 17 |
|
|
|
|
| 41 |
|
| 42 |
You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
|
| 43 |
|
|
|
|
| 44 |
| Model | Dataset | Description |
|
| 45 |
|-------|---------|-------------|
|
| 46 |
| [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
|
|
|
|
| 53 |
---
|
| 54 |
|
| 55 |
## Generate Mel spectrogram dataset from directory of audio files
|
| 56 |
+
|
| 57 |
#### Install
|
| 58 |
+
|
| 59 |
```bash
|
| 60 |
pip install .
|
| 61 |
```
|
| 62 |
|
| 63 |
+
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results
|
| 64 |
+
|
| 65 |
```bash
|
| 66 |
python scripts/audio_to_images.py \
|
| 67 |
+
--resolution 64,64 \
|
| 68 |
+
--hop_length 1024 \
|
| 69 |
+
--input_dir path-to-audio-files \
|
| 70 |
+
--output_dir path-to-output-data
|
| 71 |
```
|
| 72 |
|
| 73 |
+
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`)
|
| 74 |
+
|
| 75 |
```bash
|
| 76 |
python scripts/audio_to_images.py \
|
| 77 |
+
--resolution 256 \
|
| 78 |
+
--input_dir path-to-audio-files \
|
| 79 |
+
--output_dir data/audio-diffusion-256 \
|
| 80 |
+
--push_to_hub teticio/audio-diffusion-256
|
| 81 |
```
|
| 82 |
|
| 83 |
Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
|
| 84 |
|
| 85 |
## Train model
|
| 86 |
+
|
| 87 |
+
#### Run training on local machine
|
| 88 |
+
|
| 89 |
```bash
|
| 90 |
accelerate launch --config_file config/accelerate_local.yaml \
|
| 91 |
+
scripts/train_unconditional.py \
|
| 92 |
+
--dataset_name data/audio-diffusion-64 \
|
| 93 |
+
--hop_length 1024 \
|
| 94 |
+
--output_dir models/ddpm-ema-audio-64 \
|
| 95 |
+
--train_batch_size 16 \
|
| 96 |
+
--num_epochs 100 \
|
| 97 |
+
--gradient_accumulation_steps 1 \
|
| 98 |
+
--learning_rate 1e-4 \
|
| 99 |
+
--lr_warmup_steps 500 \
|
| 100 |
+
--mixed_precision no
|
| 101 |
```
|
| 102 |
|
| 103 |
+
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub
|
| 104 |
+
|
| 105 |
```bash
|
| 106 |
accelerate launch --config_file config/accelerate_local.yaml \
|
| 107 |
+
scripts/train_unconditional.py \
|
| 108 |
+
--dataset_name teticio/audio-diffusion-256 \
|
| 109 |
+
--output_dir models/audio-diffusion-256 \
|
| 110 |
+
--num_epochs 100 \
|
| 111 |
+
--train_batch_size 2 \
|
| 112 |
+
--eval_batch_size 2 \
|
| 113 |
+
--gradient_accumulation_steps 8 \
|
| 114 |
+
--learning_rate 1e-4 \
|
| 115 |
+
--lr_warmup_steps 500 \
|
| 116 |
+
--mixed_precision no \
|
| 117 |
+
--push_to_hub True \
|
| 118 |
+
--hub_model_id audio-diffusion-256 \
|
| 119 |
+
--hub_token $(cat $HOME/.huggingface/token)
|
| 120 |
```
|
| 121 |
|
| 122 |
+
#### Run training on SageMaker
|
| 123 |
+
|
| 124 |
```bash
|
| 125 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
| 126 |
+
scripts/train_unconditional.py \
|
| 127 |
+
--dataset_name teticio/audio-diffusion-256 \
|
| 128 |
+
--output_dir models/ddpm-ema-audio-256 \
|
| 129 |
+
--train_batch_size 16 \
|
| 130 |
+
--num_epochs 100 \
|
| 131 |
+
--gradient_accumulation_steps 1 \
|
| 132 |
+
--learning_rate 1e-4 \
|
| 133 |
+
--lr_warmup_steps 500 \
|
| 134 |
+
--mixed_precision no
|
| 135 |
```
|
| 136 |
|
| 137 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
| 138 |
+
|
| 139 |
#### A DDIM can be trained by adding the parameter
|
| 140 |
+
|
| 141 |
```bash
|
| 142 |
+
--scheduler ddim
|
| 143 |
```
|
| 144 |
|
| 145 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
| 146 |
|
| 147 |
## Latent Audio Diffusion
|
| 148 |
+
|
| 149 |
Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
| 150 |
|
| 151 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
| 152 |
|
| 153 |
+
#### Train latent diffusion model using pre-trained VAE
|
| 154 |
+
|
| 155 |
```bash
|
| 156 |
accelerate launch ...
|
| 157 |
+
...
|
| 158 |
+
--vae teticio/latent-audio-diffusion-256
|
| 159 |
```
|
| 160 |
|
| 161 |
+
#### Install dependencies to train with Stable Diffusion
|
| 162 |
+
|
| 163 |
+
```bash
|
| 164 |
pip install omegaconf pytorch_lightning
|
| 165 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
| 166 |
pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
|
| 167 |
```
|
| 168 |
|
| 169 |
+
#### Train an autoencoder
|
| 170 |
+
|
| 171 |
```bash
|
| 172 |
python scripts/train_vae.py \
|
| 173 |
+
--dataset_name teticio/audio-diffusion-256 \
|
| 174 |
+
--batch_size 2 \
|
| 175 |
+
--gradient_accumulation_steps 12
|
| 176 |
```
|
| 177 |
|
| 178 |
+
#### Train latent diffusion model
|
| 179 |
+
|
| 180 |
```bash
|
| 181 |
accelerate launch ...
|
| 182 |
+
...
|
| 183 |
+
--vae models/autoencoder-kl
|
| 184 |
```
|
setup.cfg
CHANGED
|
@@ -3,6 +3,7 @@ name = audiodiffusion
|
|
| 3 |
version = attr: audiodiffusion.VERSION
|
| 4 |
description = Generate Mel spectrogram dataset from directory of audio files.
|
| 5 |
long_description = file: README.md
|
|
|
|
| 6 |
license = GPL3
|
| 7 |
classifiers =
|
| 8 |
Programming Language :: Python :: 3
|
|
|
|
| 3 |
version = attr: audiodiffusion.VERSION
|
| 4 |
description = Generate Mel spectrogram dataset from directory of audio files.
|
| 5 |
long_description = file: README.md
|
| 6 |
+
long_description_content_type = text/markdown
|
| 7 |
license = GPL3
|
| 8 |
classifiers =
|
| 9 |
Programming Language :: Python :: 3
|