Spaces:
Runtime error
Runtime error
update readme
Browse files- README.md +27 -2
- scripts/train_vae.py +4 -3
README.md
CHANGED
|
@@ -15,7 +15,10 @@ license: gpl-3.0
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
-
**UPDATES**:
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
4/10/2022
|
| 21 |
It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
|
|
@@ -49,6 +52,7 @@ You can play around with some pretrained models on [Google Colab](https://colab.
|
|
| 49 |
```bash
|
| 50 |
pip install .
|
| 51 |
```
|
|
|
|
| 52 |
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
|
| 53 |
|
| 54 |
```bash
|
|
@@ -58,8 +62,8 @@ python scripts/audio_to_images.py \
|
|
| 58 |
--input_dir path-to-audio-files \
|
| 59 |
--output_dir path-to-output-data
|
| 60 |
```
|
| 61 |
-
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
|
| 62 |
|
|
|
|
| 63 |
```bash
|
| 64 |
python scripts/audio_to_images.py \
|
| 65 |
--resolution 256 \
|
|
@@ -67,6 +71,7 @@ python scripts/audio_to_images.py \
|
|
| 67 |
--output_dir data/audio-diffusion-256 \
|
| 68 |
--push_to_hub teticio/audio-diffusion-256
|
| 69 |
```
|
|
|
|
| 70 |
## Train model
|
| 71 |
#### Run training on local machine.
|
| 72 |
```bash
|
|
@@ -83,6 +88,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
|
|
| 83 |
--lr_warmup_steps 500 \
|
| 84 |
--mixed_precision no
|
| 85 |
```
|
|
|
|
| 86 |
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
|
| 87 |
```bash
|
| 88 |
accelerate launch --config_file config/accelerate_local.yaml \
|
|
@@ -101,6 +107,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
|
|
| 101 |
--hub_model_id audio-diffusion-256 \
|
| 102 |
--hub_token $(cat $HOME/.huggingface/token)
|
| 103 |
```
|
|
|
|
| 104 |
#### Run training on SageMaker.
|
| 105 |
```bash
|
| 106 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
@@ -115,3 +122,21 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
| 115 |
--lr_warmup_steps 500 \
|
| 116 |
--mixed_precision no
|
| 117 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
+
**UPDATES**:
|
| 19 |
+
|
| 20 |
+
15/10/2022
|
| 21 |
+
Added latent audio diffusion (see below).
|
| 22 |
|
| 23 |
4/10/2022
|
| 24 |
It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
|
|
|
|
| 52 |
```bash
|
| 53 |
pip install .
|
| 54 |
```
|
| 55 |
+
|
| 56 |
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
|
| 57 |
|
| 58 |
```bash
|
|
|
|
| 62 |
--input_dir path-to-audio-files \
|
| 63 |
--output_dir path-to-output-data
|
| 64 |
```
|
|
|
|
| 65 |
|
| 66 |
+
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
|
| 67 |
```bash
|
| 68 |
python scripts/audio_to_images.py \
|
| 69 |
--resolution 256 \
|
|
|
|
| 71 |
--output_dir data/audio-diffusion-256 \
|
| 72 |
--push_to_hub teticio/audio-diffusion-256
|
| 73 |
```
|
| 74 |
+
|
| 75 |
## Train model
|
| 76 |
#### Run training on local machine.
|
| 77 |
```bash
|
|
|
|
| 88 |
--lr_warmup_steps 500 \
|
| 89 |
--mixed_precision no
|
| 90 |
```
|
| 91 |
+
|
| 92 |
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
|
| 93 |
```bash
|
| 94 |
accelerate launch --config_file config/accelerate_local.yaml \
|
|
|
|
| 107 |
--hub_model_id audio-diffusion-256 \
|
| 108 |
--hub_token $(cat $HOME/.huggingface/token)
|
| 109 |
```
|
| 110 |
+
|
| 111 |
#### Run training on SageMaker.
|
| 112 |
```bash
|
| 113 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
|
|
| 122 |
--lr_warmup_steps 500 \
|
| 123 |
--mixed_precision no
|
| 124 |
```
|
| 125 |
+
## Latent Audio Diffusion
|
| 126 |
+
Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, as the latent space is really a array (tensor) of guassian variables with a particular mean, decoded images are invariant to guassian noise. And thirdly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
| 127 |
+
|
| 128 |
+
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format.
|
| 129 |
+
|
| 130 |
+
#### Train an autoencoder.
|
| 131 |
+
```bash
|
| 132 |
+
python scripts/train_vae.py \
|
| 133 |
+
--dataset_name teticio/audio-diffusion-256 \
|
| 134 |
+
--batch_size 2 \
|
| 135 |
+
--gradient_accumulation_steps 12
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
#### Train latent diffusion model.
|
| 139 |
+
```bash
|
| 140 |
+
accelerate launch ...
|
| 141 |
+
--vae models/autoencoder-kl
|
| 142 |
+
```
|
scripts/train_vae.py
CHANGED
|
@@ -4,7 +4,6 @@
|
|
| 4 |
|
| 5 |
# TODO
|
| 6 |
# grayscale
|
| 7 |
-
# update README
|
| 8 |
|
| 9 |
import os
|
| 10 |
import argparse
|
|
@@ -107,7 +106,7 @@ class ImageLogger(Callback):
|
|
| 107 |
|
| 108 |
class HFModelCheckpoint(ModelCheckpoint):
|
| 109 |
|
| 110 |
-
def __init__(self, ldm_config, hf_checkpoint
|
| 111 |
super().__init__(*args, **kwargs)
|
| 112 |
self.ldm_config = ldm_config
|
| 113 |
self.hf_checkpoint = hf_checkpoint
|
|
@@ -131,7 +130,9 @@ if __name__ == "__main__":
|
|
| 131 |
parser.add_argument("--ldm_checkpoint_dir",
|
| 132 |
type=str,
|
| 133 |
default="models/ldm-autoencoder-kl")
|
| 134 |
-
parser.add_argument("--hf_checkpoint_dir",
|
|
|
|
|
|
|
| 135 |
parser.add_argument("-r",
|
| 136 |
"--resume_from_checkpoint",
|
| 137 |
type=str,
|
|
|
|
| 4 |
|
| 5 |
# TODO
|
| 6 |
# grayscale
|
|
|
|
| 7 |
|
| 8 |
import os
|
| 9 |
import argparse
|
|
|
|
| 106 |
|
| 107 |
class HFModelCheckpoint(ModelCheckpoint):
|
| 108 |
|
| 109 |
+
def __init__(self, ldm_config, hf_checkpoint, *args, **kwargs):
|
| 110 |
super().__init__(*args, **kwargs)
|
| 111 |
self.ldm_config = ldm_config
|
| 112 |
self.hf_checkpoint = hf_checkpoint
|
|
|
|
| 130 |
parser.add_argument("--ldm_checkpoint_dir",
|
| 131 |
type=str,
|
| 132 |
default="models/ldm-autoencoder-kl")
|
| 133 |
+
parser.add_argument("--hf_checkpoint_dir",
|
| 134 |
+
type=str,
|
| 135 |
+
default="models/autoencoder-kl")
|
| 136 |
parser.add_argument("-r",
|
| 137 |
"--resume_from_checkpoint",
|
| 138 |
type=str,
|