| --- |
| title: Dataset Preprocessing |
| description: How datasets are processed |
| --- |
|
|
| ## Overview |
|
|
| Dataset pre-processing is the step where Axolotl takes each dataset you |
| the [dataset format](dataset-formats) and prompt strategies to: |
|
|
| - parse the dataset based on the *dataset format* |
| - transform the dataset to how you would interact with the model based on the *prompt strategy* |
| - tokenize the dataset based on the configured model & tokenizer |
| - shuffle and merge multiple datasets together if using more than one |
|
|
| The processing of the datasets can happen one of two ways: |
|
|
| 1. Before kicking off training by calling `axolotl preprocess config.yaml --debug` |
| 2. When training is started |
|
|
| ### What are the benefits of pre-processing? |
|
|
| When training interactively or for sweeps |
| (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly |
| slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent |
| training parameters so that it will intelligently pull from its cache when possible. |
|
|
| The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example |
| YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data. |
|
|
| If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a |
| default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly |
| setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed |
| data is in the cache. |
|
|
| ### What are the edge cases? |
|
|
| Let |
| prompt template. Because the trainer cannot readily detect these changes, we cannot change the |
| calculated hash value for the pre-processed dataset. |
|
|
| If you have `dataset_prepared_path: ...` set |
| and change your prompt templating logic, it may not pick up the changes you made and you will be |
| training over the old prompt. |
|
|