Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / accelerate /pr_4021 /en /package_reference /launchers.md

HuggingFaceDocBuilder

about 1 month ago

preview code

download

raw

3.63 kB

	# Launchers

	Functions for launching training on distributed processes.

	## notebook_launcher[[accelerate.notebook_launcher]]

	#### accelerate.notebook_launcher[[accelerate.notebook_launcher]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/launchers.py#L41)

	Launches a training function, using several processes or multiple nodes if it's possible in the current environment
	(TPU with multiple cores for instance).

	To use this function absolutely zero calls to a device must be made in the notebook session before calling. If any
	have been made, you will need to restart the notebook and make sure no cells use any device capability.

	Setting `ACCELERATE_DEBUG_MODE="1"` in your environment will run a test before truly launching to ensure that none
	of those calls have been made.

	Example:

	```python
	# Assume this is defined in a Jupyter Notebook on an instance with two devices
	from accelerate import notebook_launcher

	def train(*args):
	# Your training function here
	...

	notebook_launcher(train, args=(arg1, arg2), num_processes=2, mixed_precision="fp16")
	```

	Parameters:

	function (`Callable`) : The training function to execute. If it accepts arguments, the first argument should be the index of the process run.

	args (`Tuple`) : Tuple of arguments to pass to the function (it will receive `*args`).

	num_processes (`int`, optional) : The number of processes to use for training. Will default to 8 in Colab/Kaggle if a TPU is available, to the number of devices available otherwise.

	mixed_precision (`str`, optional, defaults to `"no"`) : If `fp16` or `bf16`, will use mixed precision training on multi-device.

	use_port (`str`, optional, defaults to `"29500"`) : The port to use to communicate between processes when launching a multi-device training.

	master_addr (`str`, optional, defaults to `"127.0.0.1"`) : The address to use for communication between processes.

	node_rank (`int`, optional, defaults to 0) : The rank of the current node.

	num_nodes (`int`, optional, defaults to 1) : The number of nodes to use for training.

	rdzv_backend (`str`, optional, defaults to `"static"`) : The rendezvous method to use, such as 'static' (the default) or 'c10d'

	rdzv_endpoint (`str`, optional, defaults to `""`) : The endpoint of the rdzv sync. storage.

	rdzv_conf (`Dict`, optional, defaults to `None`) : Additional rendezvous configuration.

	rdzv_id (`str`, optional, defaults to `"none"`) : The unique run id of the job.

	max_restarts (`int`, optional, defaults to 0) : The maximum amount of restarts that elastic agent will conduct on workers before failure.

	monitor_interval (`float`, optional, defaults to 0.1) : The interval in seconds that is used by the elastic_agent as a period of monitoring workers.

	log_line_prefix_template (`str`, optional, defaults to `None`) : The prefix template for elastic launch logging. Available from PyTorch 2.2.0.

	## debug_launcher[[accelerate.debug_launcher]]

	#### accelerate.debug_launcher[[accelerate.debug_launcher]]

	[Source](https://github.com/huggingface/accelerate/blob/vr_4021/src/accelerate/launchers.py#L276)

	Launches a training function using several processes on CPU for debugging purposes.

	This function is provided for internal testing and debugging, but it's not intended for real trainings. It will
	only use the CPU.

	Parameters:

	function (`Callable`) : The training function to execute.

	args (`Tuple`) : Tuple of arguments to pass to the function (it will receive `*args`).

	num_processes (`int`, optional, defaults to 2) : The number of processes to use for training.

Xet Storage Details

Size:: 3.63 kB
Xet hash:: 9d0c5171dd22ffe09026e1718a733fa05ed407c9887da3f7958ac3f4d6ea43ca

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.