Pre Encoding
When training models on encoded latents from a frozen pre-trained autoencoder, the encoder is typically frozen. Because of that, it is common to pre-encode audio to latents and store them on disk instead of computing them on-the-fly during training. This can improve training throughput as well as free up GPU memory that would otherwise be used for encoding.
Prerequisites
To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an unwrapped autoencoder checkpoint file.
Run the Pre Encoding Script
To pre-encode latents from an autoencoder model, you can use pre_encode.py. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training.
The pre_encode.py script accepts the following command line arguments:
--model-config- Path to model config
--ckpt-path- Path to unwrapped autoencoder model checkpoint
--model-half- If true, uses half precision for model weights
- Optional
--dataset-config- Path to dataset config file
- Required
--output-path- Path to output folder
- Required
--batch-size- Batch size for processing
- Optional, defaults to 1
--sample-size- Number of audio samples to pad/crop to for pre-encoding
- Optional, defaults to 1320960 (~30 seconds)
--is-discrete- If true, treats the model as discrete, saving discrete tokens instead of continuous latents
- Optional
--num-nodes- Number of nodes to use for distributed processing, if available.
- Optional, defaults to 1
--num-workers- Number of dataloader workers
- Optional, defaults to 4
--strategy- PyTorch Lightning strategy
- Optional, defaults to 'auto'
--limit-batches- Limits the number of batches processed
- Optional
--shuffle- If true, shuffles the dataset
- Optional
For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following:
$ python3 ./pre_encode.py \
--model-config /path/to/model/config.json \
--ckpt-path /path/to/autoencoder/model.ckpt \
--model-half \
--dataset-config /path/to/dataset/config.json \
--output-path /path/to/output/dir \
--sample-size 1320960 \
When you run the above, the --output-path directory will contain numbered subdirectories for each GPU process used to encode the latents, and a details.json file that keeps track of settings used when the script was run.
Inside the numbered subdirectories, you will find the encoded latents as .npy files, along with associated .json metadata files.
/path/to/output/dir/
βββ 0
β βββ 0000000000000.json
β βββ 0000000000000.npy
β βββ 0000000000001.json
β βββ 0000000000001.npy
β βββ 0000000000002.json
β βββ 0000000000002.npy
...
βββ details.json
Training on Pre Encoded Latents
Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to train.py that points to the pre-encoded latents, specifying "dataset_type" is "pre_encoded". Under the hood, this will configure a stable_audio_tools.data.dataset.PreEncodedDataset.
The dataset config file should look something like this:
{
"dataset_type": "pre_encoded",
"datasets": [
{
"id": "my_audio",
"path": "/path/to/output/dir",
"latent_crop_length": 645
}
],
"random_crop": false
}