JSX_TTS / torch /distributed /elastic /rendezvous /__init__.py

Upload 5875 files

9dd3461 almost 3 years ago

5.82 kB

	# Copyright (c) Facebook, Inc. and its affiliates.
	# All rights reserved.
	#
	# This source code is licensed under the BSD-style license found in the
	# LICENSE file in the root directory of this source tree.

	"""
	In the context of Torch Distributed Elastic we use the term rendezvous to
	refer to a particular functionality that combines a **distributed
	synchronization primitive with peer discovery**.

	It is used by Torch Distributed Elastic to gather participants of a training
	job (i.e. nodes) such that they all agree on the same list of participants and
	everyone's roles, as well as make a consistent collective decision on when
	training can begin/resume.

	Torch Distributed Elastic rendezvous provides the following critical
	functionalities:

	Barrier:

	Nodes performing rendezvous will all block until the rendezvous is considered
	complete - this happens when at least ``min`` total number of nodes have joined
	the rendezvous barrier (for the same job). This also implies the barrier is not
	necessarily of fixed size.

	There's an additional small waiting time after reaching ``min`` number of
	nodes - this is used to ensure the rendezvous is not completed "too quickly"
	(which could potentially exclude additional nodes attempting to join at
	approximately the same time).

	If ``max`` number of nodes is gathered at the barrier, the rendezvous is
	completed immediately.

	There's also an overall timeout which causes the rendezvous to fail if ``min``
	number of nodes is never reached - this is meant to be a simple fail-safe to
	help release partially allocated job resources, in case there's a problem with
	the resource manager, and is meant to be interpreted as non-retryable.

	Exclusivity:

	A simple distributed barrier would not be sufficient, as we also need to ensure
	that only one group of nodes exists at any given time (for a given job). In
	other words, new nodes (i.e. joining late) should not be able to form a parallel
	independent group of workers for the same job.

	Torch Distributed Elastic rendezvous ensures that if a group of nodes has
	already completed a rendezvous (and hence might already be training), then
	additional "late" nodes attempting to rendezvous will only announce themselves
	as waiting, and will have to wait until the (previously completed) existing
	rendezvous is destroyed first.

	Consistency:

	When a rendezvous is completed, all its members will agree on the job membership
	and everyone's role in it. This role is represented using an integer, called
	rank, that is between between 0 and world size.

	Note that ranks are not stable, in the sense that the same node can be
	assigned a different rank in the next (re-)rendezvous.

	Fault-tolerance:

	Torch Distributed Elastic rendezvous is designed to tolerate node failures
	during the rendezvous process. Should a process crash (or lose network
	connectivity, etc), between joining the rendezvous and it being completed, then
	a re-rendezvous with remaining healthy nodes will happen automatically.

	A node can also fail after it has completed (or has been observered by other
	nodes to have completed) the rendezvous - this scenario will be handled by the
	Torch Distributed Elastic ``train_loop`` instead (where it will also trigger a
	re-rendezvous).

	Shared key-value store:

	When the rendezvous is completed, a shared key-value store is created and
	returned. This store implements a ``torch.distributed.Store`` API (see
	`distributed communication docs
	<https://pytorch.org/docs/stable/distributed.html>`__).

	This store is only shared by the members of the completed rendezvous. It
	is intended to be used by Torch Distributed Elastic to exchange information
	necessary to initialize job control and data-planes.

	Waiting workers and rendezvous closing:

	Torch Distributed Elastic rendezvous handler object provides additional
	functionalities, which are technically not part of the rendezvous process:

	1. Querying how many workers arrived late at the barrier, who can participate in
	next rendezvous.

	2. Setting the rendezvous closed to signal all nodes not to participate in
	next rendezvous.

	DynamicRendezvousHandler:

	Torch Distributed Elastic comes with the :py:class:`.DynamicRendezvousHandler`
	class that implements the rendezvous mechanism described above. It is a backend-
	agnostic type that expects a particular :py:class:`.RendezvousBackend` instance
	to be specified during construction.

	Torch distributed users can either implement their own backend type or use one
	of the following implementations that come with PyTorch:

	- :py:class:`.C10dRendezvousBackend`: Uses a C10d store (by default
	``TCPStore``) as the rendezvous backend. The main advantage of using a C10d
	store is that it requires no 3rd-party dependency (such as etcd) to establish
	a rendezvous.
	- :py:class:`.EtcdRendezvousBackend`: Supersedes the legacy
	:py:class:`.EtcdRendezvousHandler` class. Passing an
	:py:class:`.EtcdRendezvousBackend` instance to
	:py:class:`.DynamicRendezvousHandler` is functionally equivalent to
	instantiating an :py:class:`.EtcdRendezvousHandler`.

	::

	store = TCPStore("localhost")

	backend = C10dRendezvousBackend(store, "my_run_id")

	rdzv_handler = DynamicRendezvousHandler.from_backend(
	run_id="my_run_id",
	store=store,
	backend=backend,
	min_nodes=2,
	max_nodes=4
	)
	"""

	from .api import * # noqa: F403
	from .registry import _register_default_handlers


	_register_default_handlers()


	__all__ = [
	"RendezvousClosedError",
	"RendezvousConnectionError",
	"RendezvousError",
	"RendezvousHandler",
	"RendezvousHandlerCreator",
	"RendezvousHandlerRegistry",
	"RendezvousParameters",
	"RendezvousStateError",
	"RendezvousTimeoutError",
	"rendezvous_handler_registry",
	]