finetune-demo-lora / docs /dataset_preprocessing.qmd

Training in progress, step 20

fcca8c8 verified 2 months ago

2.03 kB

	---
	title: Dataset Preprocessing
	description: How datasets are processed
	---

	## Overview

	Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
	the [dataset format](dataset-formats) and prompt strategies to:

	- parse the dataset based on the dataset format
	- transform the dataset to how you would interact with the model based on the prompt strategy
	- tokenize the dataset based on the configured model & tokenizer
	- shuffle and merge multiple datasets together if using more than one

	The processing of the datasets can happen one of two ways:

	1. Before kicking off training by calling `axolotl preprocess config.yaml --debug`
	2. When training is started

	### What are the benefits of pre-processing?

	When training interactively or for sweeps
	(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
	slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
	training parameters so that it will intelligently pull from its cache when possible.

	The path of the cache is controlled by `dataset_prepared_path:` and is often left blank in example
	YAMLs as this leads to a more robust solution that prevents unexpectedly reusing cached data.

	If `dataset_prepared_path:` is left empty, when training, the processed dataset will be cached in a
	default path of `./last_run_prepared/`, but will ignore anything already cached there. By explicitly
	setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
	data is in the cache.

	### What are the edge cases?

	Let's say you are writing a custom prompt strategy or using a user-defined
	prompt template. Because the trainer cannot readily detect these changes, we cannot change the
	calculated hash value for the pre-processed dataset.

	If you have `dataset_prepared_path: ...` set
	and change your prompt templating logic, it may not pick up the changes you made and you will be
	training over the old prompt.