| # Pre Encoding |
|
|
| When training models on encoded latents from a frozen pre-trained autoencoder, the encoder is typically frozen. Because of that, it is common to pre-encode audio to latents and store them on disk instead of computing them on-the-fly during training. This can improve training throughput as well as free up GPU memory that would otherwise be used for encoding. |
|
|
| ## Prerequisites |
|
|
| To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an **unwrapped** autoencoder checkpoint file. |
|
|
| ## Run the Pre Encoding Script |
|
|
| To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training. |
|
|
| The `pre_encode.py` script accepts the following command line arguments: |
|
|
| - `--model-config` |
| - Path to model config |
| - `--ckpt-path` |
| - Path to **unwrapped** autoencoder model checkpoint |
| - `--model-half` |
| - If true, uses half precision for model weights |
| - Optional |
| - `--dataset-config` |
| - Path to dataset config file |
| - Required |
| - `--output-path` |
| - Path to output folder |
| - Required |
| - `--batch-size` |
| - Batch size for processing |
| - Optional, defaults to 1 |
| - `--sample-size` |
| - Number of audio samples to pad/crop to for pre-encoding |
| - Optional, defaults to 1320960 (~30 seconds) |
| - `--is-discrete` |
| - If true, treats the model as discrete, saving discrete tokens instead of continuous latents |
| - Optional |
| - `--num-nodes` |
| - Number of nodes to use for distributed processing, if available. |
| - Optional, defaults to 1 |
| - `--num-workers` |
| - Number of dataloader workers |
| - Optional, defaults to 4 |
| - `--strategy` |
| - PyTorch Lightning strategy |
| - Optional, defaults to 'auto' |
| - `--limit-batches` |
| - Limits the number of batches processed |
| - Optional |
| - `--shuffle` |
| - If true, shuffles the dataset |
| - Optional |
|
|
| For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following: |
|
|
| ```bash |
| $ python3 ./pre_encode.py \ |
| --model-config /path/to/model/config.json \ |
| --ckpt-path /path/to/autoencoder/model.ckpt \ |
| --model-half \ |
| --dataset-config /path/to/dataset/config.json \ |
| --output-path /path/to/output/dir \ |
| --sample-size 1320960 \ |
| ``` |
|
|
| When you run the above, the `--output-path` directory will contain numbered subdirectories for each GPU process used to encode the latents, and a `details.json` file that keeps track of settings used when the script was run. |
|
|
| Inside the numbered subdirectories, you will find the encoded latents as `.npy` files, along with associated `.json` metadata files. |
|
|
| ```bash |
| /path/to/output/dir/ |
| βββ 0 |
| β βββ 0000000000000.json |
| β βββ 0000000000000.npy |
| β βββ 0000000000001.json |
| β βββ 0000000000001.npy |
| β βββ 0000000000002.json |
| β βββ 0000000000002.npy |
| ... |
| βββ details.json |
| ``` |
|
|
| ## Training on Pre Encoded Latents |
|
|
| Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`. |
|
|
| The dataset config file should look something like this: |
|
|
| ```json |
| { |
| "dataset_type": "pre_encoded", |
| "datasets": [ |
| { |
| "id": "my_audio", |
| "path": "/path/to/output/dir", |
| "latent_crop_length": 645 |
| } |
| ], |
| "random_crop": false |
| } |
| ``` |