Spaces:

teticio
/

audio-diffusion

Runtime error

App Files Files Community

teticio commited on Oct 15, 2022

Commit

d122744

1 Parent(s): 96e542f

update readme

Browse files

Files changed (2) hide show

README.md +27 -2
scripts/train_vae.py +4 -3

README.md CHANGED Viewed

@@ -15,7 +15,10 @@ license: gpl-3.0
 ---
-**UPDATES**:
 4/10/2022
 It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
@@ -49,6 +52,7 @@ You can play around with some pretrained models on [Google Colab](https://colab.
 ```bash
 pip install .
 ```
 #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
 ```bash
@@ -58,8 +62,8 @@ python scripts/audio_to_images.py \
   --input_dir path-to-audio-files \
   --output_dir path-to-output-data
 ```
-#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
 ```bash
 python scripts/audio_to_images.py \
   --resolution 256 \
@@ -67,6 +71,7 @@ python scripts/audio_to_images.py \
   --output_dir data/audio-diffusion-256 \
   --push_to_hub teticio/audio-diffusion-256
 ```
 ## Train model
 #### Run training on local machine.
 ```bash
@@ -83,6 +88,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
   --lr_warmup_steps 500 \
   --mixed_precision no
 ```
 #### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
 ```bash
 accelerate launch --config_file config/accelerate_local.yaml \
@@ -101,6 +107,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
   --hub_model_id audio-diffusion-256 \
   --hub_token $(cat $HOME/.huggingface/token)
 ```
 #### Run training on SageMaker.
 ```bash
 accelerate launch --config_file config/accelerate_sagemaker.yaml \
@@ -115,3 +122,21 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
   --lr_warmup_steps 500 \
   --mixed_precision no
 ```

 ---
+**UPDATES**:
+15/10/2022
+Added latent audio diffusion (see below).
 4/10/2022
 It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
 ```bash
 pip install .
 ```
 #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
 ```bash
   --input_dir path-to-audio-files \
   --output_dir path-to-output-data
 ```
+#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
 ```bash
 python scripts/audio_to_images.py \
   --resolution 256 \
   --output_dir data/audio-diffusion-256 \
   --push_to_hub teticio/audio-diffusion-256
 ```
 ## Train model
 #### Run training on local machine.
 ```bash
   --lr_warmup_steps 500 \
   --mixed_precision no
 ```
 #### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
 ```bash
 accelerate launch --config_file config/accelerate_local.yaml \
   --hub_model_id audio-diffusion-256 \
   --hub_token $(cat $HOME/.huggingface/token)
 ```
 #### Run training on SageMaker.
 ```bash
 accelerate launch --config_file config/accelerate_sagemaker.yaml \
   --lr_warmup_steps 500 \
   --mixed_precision no
 ```
+## Latent Audio Diffusion
+Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, as the latent space is really a array (tensor) of guassian variables with a particular mean, decoded images are invariant to guassian noise. And thirdly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
+At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format.
+#### Train an autoencoder.
+```bash
+python scripts/train_vae.py \
+  --dataset_name teticio/audio-diffusion-256 \
+  --batch_size 2 \
+  --gradient_accumulation_steps 12
+```
+#### Train latent diffusion model.
+```bash
+accelerate launch ...
+  --vae models/autoencoder-kl
+```

scripts/train_vae.py CHANGED Viewed

@@ -4,7 +4,6 @@
 # TODO
 # grayscale
-# update README
 import os
 import argparse
@@ -107,7 +106,7 @@ class ImageLogger(Callback):
 class HFModelCheckpoint(ModelCheckpoint):
-    def __init__(self, ldm_config, hf_checkpoint='models/autoencoder-kl', *args, **kwargs):
         super().__init__(*args, **kwargs)
         self.ldm_config = ldm_config
         self.hf_checkpoint = hf_checkpoint
@@ -131,7 +130,9 @@ if __name__ == "__main__":
     parser.add_argument("--ldm_checkpoint_dir",
                         type=str,
                         default="models/ldm-autoencoder-kl")
-    parser.add_argument("--hf_checkpoint_dir", type=str, default="vae_model")
     parser.add_argument("-r",
                         "--resume_from_checkpoint",
                         type=str,

 # TODO
 # grayscale
 import os
 import argparse
 class HFModelCheckpoint(ModelCheckpoint):
+    def __init__(self, ldm_config, hf_checkpoint, *args, **kwargs):
         super().__init__(*args, **kwargs)
         self.ldm_config = ldm_config
         self.hf_checkpoint = hf_checkpoint
     parser.add_argument("--ldm_checkpoint_dir",
                         type=str,
                         default="models/ldm-autoencoder-kl")
+    parser.add_argument("--hf_checkpoint_dir",
+                        type=str,
+                        default="models/autoencoder-kl")
     parser.add_argument("-r",
                         "--resume_from_checkpoint",
                         type=str,