Spaces:

teticio
/

audio-diffusion

Runtime error

App Files Files Community

teticio commited on Nov 9, 2022

Commit

bf017fc

1 Parent(s): 0f3ac5f

release to pypi

Browse files

Files changed (3) hide show

.gitignore +1 -0
README.md +74 -60
setup.cfg +1 -0

.gitignore CHANGED Viewed

@@ -5,6 +5,7 @@ data
 models
 flagged
 build
 audiodiffusion.egg-info
 lightning_logs
 taming

 models
 flagged
 build
+dist
 audiodiffusion.egg-info
 lightning_logs
 taming

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ license: gpl-3.0
 ---
 # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
-### Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package.
 ---
@@ -41,7 +41,6 @@ A DDPM is trained on a set of mel spectrograms that have been generated from a d
 You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
 | Model | Dataset | Description |
 |-------|---------|-------------|
 | [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
@@ -54,117 +53,132 @@ You can play around with some pre-trained models on [Google Colab](https://colab
 ---
 ## Generate Mel spectrogram dataset from directory of audio files
 #### Install
 ```bash
 pip install .
 ```
-#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
 ```bash
 python scripts/audio_to_images.py \
-  --resolution 64,64 \
-  --hop_length 1024 \
-  --input_dir path-to-audio-files \
-  --output_dir path-to-output-data
 ```
-#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
 ```bash
 python scripts/audio_to_images.py \
-  --resolution 256 \
-  --input_dir path-to-audio-files \
-  --output_dir data/audio-diffusion-256 \
-  --push_to_hub teticio/audio-diffusion-256
 ```
 Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
 ## Train model
-#### Run training on local machine.
 ```bash
 accelerate launch --config_file config/accelerate_local.yaml \
-  scripts/train_unconditional.py \
-  --dataset_name data/audio-diffusion-64 \
-  --hop_length 1024 \
-  --output_dir models/ddpm-ema-audio-64 \
-  --train_batch_size 16 \
-  --num_epochs 100 \
-  --gradient_accumulation_steps 1 \
-  --learning_rate 1e-4 \
-  --lr_warmup_steps 500 \
-  --mixed_precision no
 ```
-#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
 ```bash
 accelerate launch --config_file config/accelerate_local.yaml \
-  scripts/train_unconditional.py \
-  --dataset_name teticio/audio-diffusion-256 \
-  --output_dir models/audio-diffusion-256 \
-  --num_epochs 100 \
-  --train_batch_size 2 \
-  --eval_batch_size 2 \
-  --gradient_accumulation_steps 8 \
-  --learning_rate 1e-4 \
-  --lr_warmup_steps 500 \
-  --mixed_precision no \
-  --push_to_hub True \
-  --hub_model_id audio-diffusion-256 \
-  --hub_token $(cat $HOME/.huggingface/token)
 ```
-#### Run training on SageMaker.
 ```bash
 accelerate launch --config_file config/accelerate_sagemaker.yaml \
-  scripts/train_unconditional.py \
-  --dataset_name teticio/audio-diffusion-256 \
-  --output_dir models/ddpm-ema-audio-256 \
-  --train_batch_size 16 \
-  --num_epochs 100 \
-  --gradient_accumulation_steps 1 \
-  --learning_rate 1e-4 \
-  --lr_warmup_steps 500 \
-  --mixed_precision no
 ```
 ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
 #### A DDIM can be trained by adding the parameter
 ```bash
-  --scheduler ddim
 ```
 Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
 ## Latent Audio Diffusion
 Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
 At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
-#### Train latent diffusion model using pre-trained VAE.
 ```bash
 accelerate launch ...
-  ...
-  --vae teticio/latent-audio-diffusion-256
 ```
-#### Install dependencies to train with Stable Diffusion.
-```
 pip install omegaconf pytorch_lightning
 pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
 pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
 ```
-#### Train an autoencoder.
 ```bash
 python scripts/train_vae.py \
-  --dataset_name teticio/audio-diffusion-256 \
-  --batch_size 2 \
-  --gradient_accumulation_steps 12
 ```
-#### Train latent diffusion model.
 ```bash
 accelerate launch ...
-  ...
-  --vae models/autoencoder-kl
 ```

 ---
 # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
+## Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package
 ---
 You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
 | Model | Dataset | Description |
 |-------|---------|-------------|
 | [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
 ---
 ## Generate Mel spectrogram dataset from directory of audio files
 #### Install
 ```bash
 pip install .
 ```
+#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results
 ```bash
 python scripts/audio_to_images.py \
+--resolution 64,64 \
+--hop_length 1024 \
+--input_dir path-to-audio-files \
+--output_dir path-to-output-data
 ```
+#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`)
 ```bash
 python scripts/audio_to_images.py \
+--resolution 256 \
+--input_dir path-to-audio-files \
+--output_dir data/audio-diffusion-256 \
+--push_to_hub teticio/audio-diffusion-256
 ```
 Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
 ## Train model
+#### Run training on local machine
 ```bash
 accelerate launch --config_file config/accelerate_local.yaml \
+scripts/train_unconditional.py \
+--dataset_name data/audio-diffusion-64 \
+--hop_length 1024 \
+--output_dir models/ddpm-ema-audio-64 \
+--train_batch_size 16 \
+--num_epochs 100 \
+--gradient_accumulation_steps 1 \
+--learning_rate 1e-4 \
+--lr_warmup_steps 500 \
+--mixed_precision no
 ```
+#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub
 ```bash
 accelerate launch --config_file config/accelerate_local.yaml \
+scripts/train_unconditional.py \
+--dataset_name teticio/audio-diffusion-256 \
+--output_dir models/audio-diffusion-256 \
+--num_epochs 100 \
+--train_batch_size 2 \
+--eval_batch_size 2 \
+--gradient_accumulation_steps 8 \
+--learning_rate 1e-4 \
+--lr_warmup_steps 500 \
+--mixed_precision no \
+--push_to_hub True \
+--hub_model_id audio-diffusion-256 \
+--hub_token $(cat $HOME/.huggingface/token)
 ```
+#### Run training on SageMaker
 ```bash
 accelerate launch --config_file config/accelerate_sagemaker.yaml \
+scripts/train_unconditional.py \
+--dataset_name teticio/audio-diffusion-256 \
+--output_dir models/ddpm-ema-audio-256 \
+--train_batch_size 16 \
+--num_epochs 100 \
+--gradient_accumulation_steps 1 \
+--learning_rate 1e-4 \
+--lr_warmup_steps 500 \
+--mixed_precision no
 ```
 ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
 #### A DDIM can be trained by adding the parameter
 ```bash
+--scheduler ddim
 ```
 Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
 ## Latent Audio Diffusion
 Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
 At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
+#### Train latent diffusion model using pre-trained VAE
 ```bash
 accelerate launch ...
+...
+--vae teticio/latent-audio-diffusion-256
 ```
+#### Install dependencies to train with Stable Diffusion
+```bash
 pip install omegaconf pytorch_lightning
 pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
 pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
 ```
+#### Train an autoencoder
 ```bash
 python scripts/train_vae.py \
+--dataset_name teticio/audio-diffusion-256 \
+--batch_size 2 \
+--gradient_accumulation_steps 12
 ```
+#### Train latent diffusion model
 ```bash
 accelerate launch ...
+...
+--vae models/autoencoder-kl
 ```

setup.cfg CHANGED Viewed

@@ -3,6 +3,7 @@ name = audiodiffusion
 version = attr: audiodiffusion.VERSION
 description = Generate Mel spectrogram dataset from directory of audio files.
 long_description = file: README.md
 license = GPL3
 classifiers =
     Programming Language :: Python :: 3

 version = attr: audiodiffusion.VERSION
 description = Generate Mel spectrogram dataset from directory of audio files.
 long_description = file: README.md
+long_description_content_type = text/markdown
 license = GPL3
 classifiers =
     Programming Language :: Python :: 3