| # AudioCraft training pipelines |
|
|
| AudioCraft training pipelines are built on top of PyTorch as our core deep learning library |
| and [Flashy](https://github.com/facebookresearch/flashy) as our training pipeline design library, |
| and [Dora](https://github.com/facebookresearch/dora) as our experiment manager. |
| AudioCraft training pipelines are designed to be research and experiment-friendly. |
|
|
|
|
| ## Environment setup |
|
|
| For the base installation, follow the instructions from the [README.md](../README.md). |
| Below are some additional instructions for setting up environment to train new models. |
|
|
| ### Team and cluster configuration |
|
|
| In order to support multiple teams and clusters, AudioCraft uses an environment configuration. |
| The team configuration allows to specify cluster-specific configurations (e.g. SLURM configuration), |
| or convenient mapping of paths between the supported environments. |
|
|
| Each team can have a yaml file under the [configuration folder](../config). To select a team set the |
| `AUDIOCRAFT_TEAM` environment variable to a valid team name (e.g. `labs` or `default`): |
| ```shell |
| conda env config vars set AUDIOCRAFT_TEAM=default |
| ``` |
|
|
| Alternatively, you can add it to your `.bashrc`: |
| ```shell |
| export AUDIOCRAFT_TEAM=default |
| ``` |
|
|
| If not defined, the environment will default to the `default` team. |
|
|
| The cluster is automatically detected, but it is also possible to override it by setting |
| the `AUDIOCRAFT_CLUSTER` environment variable. |
|
|
| Based on this team and cluster, the environment is then configured with: |
| * The dora experiment outputs directory. |
| * The available slurm partitions: categorized by global and team. |
| * A shared reference directory: In order to facilitate sharing research models while remaining |
| agnostic to the used compute cluster, we created the `//reference` symbol that can be used in |
| YAML config to point to a defined reference folder containing shared checkpoints |
| (e.g. baselines, models for evaluation...). |
|
|
| **Important:** The default output dir for trained models and checkpoints is under `/tmp/`. This is suitable |
| only for quick testing. If you are doing anything serious you MUST edit the file `default.yaml` and |
| properly set the `dora_dir` entries. |
|
|
| #### Overriding environment configurations |
|
|
| You can set the following environmet variables to bypass the team's environment configuration: |
| * `AUDIOCRAFT_CONFIG`: absolute path to a team config yaml file. |
| * `AUDIOCRAFT_DORA_DIR`: absolute path to a custom dora directory. |
| * `AUDIOCRAFT_REFERENCE_DIR`: absolute path to the shared reference directory. |
|
|
| ## Training pipelines |
|
|
| Each task supported in AudioCraft has its own training pipeline and dedicated solver. |
| Learn more about solvers and key designs around AudioCraft training pipeline below. |
| Please refer to the documentation of each task and model for specific information on a given task. |
|
|
|
|
| ### Solvers |
|
|
| The core training component in AudioCraft is the solver. A solver holds the definition |
| of how to solve a given task: It implements the training pipeline logic, combining the datasets, |
| model, optimization criterion and components and the full training loop. We refer the reader |
| to [Flashy](https://github.com/facebookresearch/flashy) for core principles around solvers. |
|
|
| AudioCraft proposes an initial solver, the `StandardSolver` that is used as the base implementation |
| for downstream solvers. This standard solver provides a nice base management of logging, |
| checkpoints loading/saving, xp restoration, etc. on top of the base Flashy implementation. |
| In AudioCraft, we made the assumption that all tasks are following the same set of stages: |
| train, valid, evaluate and generation, each relying on a dedicated dataset. |
|
|
| Each solver is responsible for defining the task to solve and the associated stages |
| of the training loop in order to leave the full ownership of the training pipeline |
| to the researchers. This includes loading the datasets, building the model and |
| optimisation components, registering them and defining the execution of each stage. |
| To create a new solver for a given task, one should extend the StandardSolver |
| and define each stage of the training loop. One can further customise its own solver |
| starting from scratch instead of inheriting from the standard solver. |
|
|
| ```python |
| from . import base |
| from .. import optim |
| |
| |
| class MyNewSolver(base.StandardSolver): |
| |
| def __init__(self, cfg: omegaconf.DictConfig): |
| super().__init__(cfg) |
| # one can add custom attributes to the solver |
| self.criterion = torch.nn.L1Loss() |
| |
| def best_metric(self): |
| # here optionally specify which metric to use to keep track of best state |
| return 'loss' |
| |
| def build_model(self): |
| # here you can instantiate your models and optimization related objects |
| # this method will be called by the StandardSolver init method |
| self.model = ... |
| # the self.cfg attribute contains the raw configuration |
| self.optimizer = optim.build_optimizer(self.model.parameters(), self.cfg.optim) |
| # don't forget to register the states you'd like to include in your checkpoints! |
| self.register_stateful('model', 'optimizer') |
| # keep the model best state based on the best value achieved at validation for the given best_metric |
| self.register_best('model') |
| # if you want to add EMA around the model |
| self.register_ema('model') |
| |
| def build_dataloaders(self): |
| # here you can instantiate your dataloaders |
| # this method will be called by the StandardSolver init method |
| self.dataloaders = ... |
| |
| ... |
| |
| # For both train and valid stages, the StandardSolver relies on |
| # a share common_train_valid implementation that is in charge of |
| # accessing the appropriate loader, iterate over the data up to |
| # the specified number of updates_per_epoch, run the ``run_step`` |
| # function that you need to implement to specify the behavior |
| # and finally update the EMA and collect the metrics properly. |
| @abstractmethod |
| def run_step(self, idx: int, batch: tp.Any, metrics: dict): |
| """Perform one training or valid step on a given batch. |
| """ |
| ... # provide your implementation of the solver over a batch |
| |
| def train(self): |
| """Train stage. |
| """ |
| return self.common_train_valid('train') |
| |
| def valid(self): |
| """Valid stage. |
| """ |
| return self.common_train_valid('valid') |
| |
| @abstractmethod |
| def evaluate(self): |
| """Evaluate stage. |
| """ |
| ... # provide your implementation here! |
| |
| @abstractmethod |
| def generate(self): |
| """Generate stage. |
| """ |
| ... # provide your implementation here! |
| ``` |
|
|
| ### About Epochs |
|
|
| AudioCraft Solvers uses the concept of Epoch. One epoch doesn't necessarily mean one pass over the entire |
| dataset, but instead represent the smallest amount of computation that we want to work with before checkpointing. |
| Typically, we find that having an Epoch time around 30min is ideal both in terms of safety (checkpointing often enough) |
| and getting updates often enough. One Epoch is at least a `train` stage that lasts for `optim.updates_per_epoch` (2000 by default), |
| and a `valid` stage. You can control how long the valid stage takes with `dataset.valid.num_samples`. |
| Other stages (`evaluate`, `generate`) will only happen every X epochs, as given by `evaluate.every` and `generate.every`). |
|
|
|
|
| ### Models |
|
|
| In AudioCraft, a model is a container object that wraps one or more torch modules together |
| with potential processing logic to use in a solver. For example, a model would wrap an encoder module, |
| a quantisation bottleneck module, a decoder and some tensor processing logic. Each of the previous components |
| can be considered as a small « model unit » on its own but the container model is a practical component |
| to manipulate and train a set of modules together. |
|
|
| ### Datasets |
|
|
| See the [dedicated documentation on datasets](./DATASETS.md). |
|
|
| ### Metrics |
|
|
| See the [dedicated documentation on metrics](./METRICS.md). |
|
|
| ### Conditioners |
|
|
| AudioCraft language models can be conditioned in various ways and the codebase offers a modular implementation |
| of different conditioners that can be potentially combined together. |
| Learn more in the [dedicated documentation on conditioning](./CONDITIONING.md). |
|
|
| ### Configuration |
|
|
| AudioCraft's configuration is defined in yaml files and the framework relies on |
| [hydra](https://hydra.cc/docs/intro/) and [omegaconf](https://omegaconf.readthedocs.io/) to parse |
| and manipulate the configuration through Dora. |
|
|
| ##### :warning: Important considerations around configurations |
|
|
| Our configuration management relies on Hydra and the concept of group configs to structure |
| and compose configurations. Updating the root default configuration files will then have |
| an impact on all solvers and tasks. |
| **One should never change the default configuration files. Instead they should use Hydra config groups in order to store custom configuration.** |
| Once this configuration is created and used for running experiments, you should not edit it anymore. |
|
|
| Note that as we are using Dora as our experiment manager, all our experiment tracking is based on |
| signatures computed from delta between configurations. |
| **One must therefore ensure backward compatibilty of the configuration at all time.** |
| See [Dora's README](https://github.com/facebookresearch/dora) and the |
| [section below introduction Dora](#running-experiments-with-dora). |
|
|
| ##### Configuration structure |
|
|
| The configuration is organized in config groups: |
| * `conditioner`: default values for conditioning modules. |
| * `dset`: contains all data source related information (paths to manifest files |
| and metadata for a given dataset). |
| * `model`: contains configuration for each model defined in AudioCraft and configurations |
| for different variants of models. |
| * `solver`: contains the default configuration for each solver as well as configuration |
| for each solver task, combining all the above components. |
| * `teams`: contains the cluster configuration per teams. See environment setup for more details. |
|
|
| The `config.yaml` file is the main configuration that composes the above groups |
| and contains default configuration for AudioCraft. |
|
|
| ##### Solver's core configuration structure |
|
|
| The core configuration structure shared across solver is available in `solvers/default.yaml`. |
|
|
| ##### Other configuration modules |
|
|
| AudioCraft configuration contains the different setups we used for our research and publications. |
|
|
| ## Running experiments with Dora |
|
|
| ### Launching jobs |
|
|
| Try launching jobs for different tasks locally with dora run: |
|
|
| ```shell |
| # run compression task with lightweight encodec |
| dora run solver=compression/debug |
| ``` |
|
|
| Most of the time, the jobs are launched through dora grids, for example: |
|
|
| ```shell |
| # run compression task through debug grid |
| dora grid compression.debug |
| ``` |
|
|
| Learn more about running experiments with Dora below. |
|
|
| ### A small introduction to Dora |
|
|
| [Dora](https://github.com/facebookresearch/dora) is the experiment manager tool used in AudioCraft. |
| Check out the README to learn how Dora works. Here is a quick summary of what to know: |
| * An XP is a unique set of hyper-parameters with a given signature. The signature is a hash |
| of those hyper-parameters. We always refer to an XP with its signature, e.g. 9357e12e. We will see |
| after that one can retrieve the hyper-params and re-rerun it in a single command. |
| * In fact, the hash is defined as a delta between the base config and the one obtained |
| with the config overrides you passed from the command line. This means you must never change |
| the `conf/**.yaml` files directly., except for editing things like paths. Changing the default values |
| in the config files means the XP signature won't reflect that change, and wrong checkpoints might be reused. |
| I know, this is annoying, but the reason is that otherwise, any change to the config file would mean |
| that all XPs ran so far would see their signature change. |
|
|
| #### Dora commands |
|
|
| ```shell |
| dora info -f 81de367c # this will show the hyper-parameter used by a specific XP. |
| # Be careful some overrides might present twice, and the right most one |
| # will give you the right value for it. |
| |
| dora run -d -f 81de367c # run an XP with the hyper-parameters from XP 81de367c. |
| # `-d` is for distributed, it will use all available GPUs. |
| |
| dora run -d -f 81de367c dataset.batch_size=32 # start from the config of XP 81de367c but change some hyper-params. |
| # This will give you a new XP with a new signature (e.g. 3fe9c332). |
| |
| dora info -f SIG -t # will tail the log (if the XP has scheduled). |
| # if you need to access the logs of the process for rank > 0, in particular because a crash didn't happen in the main |
| # process, then use `dora info -f SIG` to get the main log name (finished into something like `/5037674_0_0_log.out`) |
| # and worker K can accessed as `/5037674_0_{K}_log.out`. |
| # This is only for scheduled jobs, for local distributed runs with `-d`, then you should go into the XP folder, |
| # and look for `worker_{K}.log` logs. |
| ``` |
|
|
| An XP runs from a specific folder based on its signature, under the |
| `<cluster_specific_path>/<user>/experiments/audiocraft/outputs/` folder. |
| You can safely interrupt a training and resume it, it will reuse any existing checkpoint, |
| as it will reuse the same folder. If you made some change to the code and need to ignore |
| a previous checkpoint you can use `dora run --clear [RUN ARGS]`. |
|
|
| If you have a Slurm cluster, you can also use the dora grid command, e.g. |
|
|
| ```shell |
| # run a dummy grid located at `audiocraft/grids/my_grid_folder/my_grid_name.py` |
| dora grid my_grid_folder.my_grid_name |
| # Run the following will simply display the grid and also initialized the Dora experiments database. |
| # You can then simply refer to a config using its signature (e.g. as `dora run -f SIG`). |
| dora grid my_grid_folder.my_grid_name --dry_run --init |
| ``` |
|
|
| Please refer to the [Dora documentation](https://github.com/facebookresearch/dora) for more information. |
|
|
|
|
| #### Clearing up past experiments |
|
|
| ```shell |
| # This will cancel all the XPs and delete their folder and checkpoints. |
| # It will then reschedule them starting from scratch. |
| dora grid my_grid_folder.my_grid_name --clear |
| # The following will delete the folder and checkpoint for a single XP, |
| # and then run it afresh. |
| dora run [-f BASE_SIG] [ARGS] --clear |
| ``` |
|
|