Spaces:

MazCodes
/

fragmenta

Running

App Files Files Community

fragmenta / stable-audio-tools /docs /pre_encoding.md

MazCodes

Upload folder using huggingface_hub

63f0b06 verified about 1 month ago

preview code

raw

history blame contribute delete

3.59 kB

	# Pre Encoding

	When training models on encoded latents from a frozen pre-trained autoencoder, the encoder is typically frozen. Because of that, it is common to pre-encode audio to latents and store them on disk instead of computing them on-the-fly during training. This can improve training throughput as well as free up GPU memory that would otherwise be used for encoding.

	## Prerequisites

	To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an unwrapped autoencoder checkpoint file.

	## Run the Pre Encoding Script

	To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training.

	The `pre_encode.py` script accepts the following command line arguments:

	- `--model-config`
	- Path to model config
	- `--ckpt-path`
	- Path to unwrapped autoencoder model checkpoint
	- `--model-half`
	- If true, uses half precision for model weights
	- Optional
	- `--dataset-config`
	- Path to dataset config file
	- Required
	- `--output-path`
	- Path to output folder
	- Required
	- `--batch-size`
	- Batch size for processing
	- Optional, defaults to 1
	- `--sample-size`
	- Number of audio samples to pad/crop to for pre-encoding
	- Optional, defaults to 1320960 (~30 seconds)
	- `--is-discrete`
	- If true, treats the model as discrete, saving discrete tokens instead of continuous latents
	- Optional
	- `--num-nodes`
	- Number of nodes to use for distributed processing, if available.
	- Optional, defaults to 1
	- `--num-workers`
	- Number of dataloader workers
	- Optional, defaults to 4
	- `--strategy`
	- PyTorch Lightning strategy
	- Optional, defaults to 'auto'
	- `--limit-batches`
	- Limits the number of batches processed
	- Optional
	- `--shuffle`
	- If true, shuffles the dataset
	- Optional

	For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following:

	```bash
	$ python3 ./pre_encode.py \
	--model-config /path/to/model/config.json \
	--ckpt-path /path/to/autoencoder/model.ckpt \
	--model-half \
	--dataset-config /path/to/dataset/config.json \
	--output-path /path/to/output/dir \
	--sample-size 1320960 \
	```

	When you run the above, the `--output-path` directory will contain numbered subdirectories for each GPU process used to encode the latents, and a `details.json` file that keeps track of settings used when the script was run.

	Inside the numbered subdirectories, you will find the encoded latents as `.npy` files, along with associated `.json` metadata files.

	```bash
	/path/to/output/dir/
	├── 0
	│ ├── 0000000000000.json
	│ ├── 0000000000000.npy
	│ ├── 0000000000001.json
	│ ├── 0000000000001.npy
	│ ├── 0000000000002.json
	│ ├── 0000000000002.npy
	...
	└── details.json
	```

	## Training on Pre Encoded Latents

	Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`.

	The dataset config file should look something like this:

	```json
	{
	"dataset_type": "pre_encoded",
	"datasets": [
	{
	"id": "my_audio",
	"path": "/path/to/output/dir",
	"latent_crop_length": 645
	}
	],
	"random_crop": false
	}
	```