# Pre Encoding When training models on encoded latents from a frozen pre-trained autoencoder, the encoder is typically frozen. Because of that, it is common to pre-encode audio to latents and store them on disk instead of computing them on-the-fly during training. This can improve training throughput as well as free up GPU memory that would otherwise be used for encoding. ## Prerequisites To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an **unwrapped** autoencoder checkpoint file. ## Run the Pre Encoding Script To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training. The `pre_encode.py` script accepts the following command line arguments: - `--model-config` - Path to model config - `--ckpt-path` - Path to **unwrapped** autoencoder model checkpoint - `--model-half` - If true, uses half precision for model weights - Optional - `--dataset-config` - Path to dataset config file - Required - `--output-path` - Path to output folder - Required - `--batch-size` - Batch size for processing - Optional, defaults to 1 - `--sample-size` - Number of audio samples to pad/crop to for pre-encoding - Optional, defaults to 1320960 (~30 seconds) - `--is-discrete` - If true, treats the model as discrete, saving discrete tokens instead of continuous latents - Optional - `--num-nodes` - Number of nodes to use for distributed processing, if available. - Optional, defaults to 1 - `--num-workers` - Number of dataloader workers - Optional, defaults to 4 - `--strategy` - PyTorch Lightning strategy - Optional, defaults to 'auto' - `--limit-batches` - Limits the number of batches processed - Optional - `--shuffle` - If true, shuffles the dataset - Optional For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following: ```bash $ python3 ./pre_encode.py \ --model-config /path/to/model/config.json \ --ckpt-path /path/to/autoencoder/model.ckpt \ --model-half \ --dataset-config /path/to/dataset/config.json \ --output-path /path/to/output/dir \ --sample-size 1320960 \ ``` When you run the above, the `--output-path` directory will contain numbered subdirectories for each GPU process used to encode the latents, and a `details.json` file that keeps track of settings used when the script was run. Inside the numbered subdirectories, you will find the encoded latents as `.npy` files, along with associated `.json` metadata files. ```bash /path/to/output/dir/ ├── 0 │ ├── 0000000000000.json │ ├── 0000000000000.npy │ ├── 0000000000001.json │ ├── 0000000000001.npy │ ├── 0000000000002.json │ ├── 0000000000002.npy ... └── details.json ``` ## Training on Pre Encoded Latents Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`. The dataset config file should look something like this: ```json { "dataset_type": "pre_encoded", "datasets": [ { "id": "my_audio", "path": "/path/to/output/dir", "latent_crop_length": 645 } ], "random_crop": false } ```