Upload folder using huggingface_hub

cde7fe4 verified 3 months ago

5.26 kB

	.. _parallelisms:

	Parallelisms
	============

	NeMo uses native PyTorch parallelism primitives for distributed training, enabling efficient multi-GPU and multi-node
	model training for Speech AI workloads.

	DDP (all collections)
	---------------------

	Distributed Data Parallelism (DDP) is the default strategy for all NeMo collections (ASR, TTS, Audio, SpeechLM2).
	It replicates the entire model on every GPU, runs each GPU on a different data shard, and synchronizes
	parameter gradients via all-reduce after each backward pass.

	When to use: DDP works well when the full model fits in a single GPU's memory.
	This covers the vast majority of ASR, TTS, and Audio training workloads.

	DDP is enabled by default in NeMo. You can configure it explicitly in YAML:

	.. code-block:: yaml

	trainer:
	strategy:
	_target_: lightning.pytorch.strategies.DDPStrategy
	gradient_as_bucket_view: true
	find_unused_parameters: true

	Or in Python:

	.. code-block:: python

	from lightning.pytorch.strategies import DDPStrategy

	trainer = pl.Trainer(
	strategy=DDPStrategy(gradient_as_bucket_view=True, find_unused_parameters=True),
	devices=8,
	accelerator="gpu",
	)

	ModelParallelStrategy (SpeechLM2)
	---------------------------------

	For SpeechLM2 models (e.g. SALM / Canary-Qwen), the backbone LLM can be too large for a single GPU.
	PyTorch Lightning's ``ModelParallelStrategy`` enables FSDP2, Tensor Parallelism (TP), and
	Sequence Parallelism (SP) using PyTorch-native DTensor.

	When to use: When training or fine-tuning SpeechLM2 models whose LLM backbone does not fit
	in a single GPU's memory, or when you want to scale training to many GPUs more efficiently
	than DDP allows.

	Requirements: Each model must implement a ``configure_model()`` method that defines how its
	layers are sharded (FSDP2) and parallelized (TP / SP). The SpeechLM2 models (SALM, DuplexEARTTS)
	already implement this. You cannot simply switch an arbitrary model from DDP to
	``ModelParallelStrategy`` without providing this implementation.

	Concepts
	^^^^^^^^

	FSDP2 (Fully Sharded Data Parallelism):
	Shards model parameters, gradients, and optimizer states across GPUs in the data-parallel
	dimension. Dramatically reduces per-GPU memory -- enabling training of models that would not
	fit with DDP. Controlled via the ``data_parallel_size`` argument.

	Tensor Parallelism (TP):
	Splits individual weight matrices across GPUs. For example, a large linear layer's weight
	is partitioned column-wise or row-wise so each GPU holds only a slice. Controlled via the
	``tensor_parallel_size`` argument. The model must define a TP sharding plan (which layers
	are split and how). SpeechLM2 models automatically use the HuggingFace TP plan for the
	backbone LLM when available.

	Sequence Parallelism (SP):
	Distributes activation memory along the sequence dimension across the TP group.
	SP is typically enabled alongside TP and reduces activation memory further.

	Configuration
	^^^^^^^^^^^^^

	To enable ``ModelParallelStrategy`` for SpeechLM2, replace the DDP strategy block in the
	trainer config. The product of ``data_parallel_size`` and ``tensor_parallel_size`` must equal
	the total number of GPUs (``devices * num_nodes``).

	In YAML (with Hydra):

	.. code-block:: yaml

	trainer:
	devices: 8
	num_nodes: 1
	accelerator: gpu
	precision: bf16-true
	strategy:
	_target_: lightning.pytorch.strategies.ModelParallelStrategy
	data_parallel_size: 4 # FSDP2: shard across 4 GPUs
	tensor_parallel_size: 2 # TP: split layers across 2 GPUs

	In Python:

	.. code-block:: python

	from lightning.pytorch.strategies import ModelParallelStrategy

	trainer = pl.Trainer(
	strategy=ModelParallelStrategy(
	data_parallel_size=4,
	tensor_parallel_size=2,
	),
	devices=8,
	accelerator="gpu",
	precision="bf16-true",
	use_distributed_sampler=False,
	)

	.. note::

	When using ``ModelParallelStrategy``, set ``use_distributed_sampler=False`` in the trainer.
	NeMo's data modules handle distributed sampling internally.

	Example: SALM with FSDP2 only (no TP)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	The simplest ``ModelParallelStrategy`` setup uses FSDP2 alone. This requires no TP plan
	and works when individual layers fit in GPU memory:

	.. code-block:: yaml

	trainer:
	devices: 8
	strategy:
	_target_: lightning.pytorch.strategies.ModelParallelStrategy
	data_parallel_size: 8
	tensor_parallel_size: 1

	Example: SALM with TP + FSDP2
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	For larger LLM backbones, combine TP with FSDP2. Here, 2-way TP splits each layer across
	2 GPUs within a node, and 4-way FSDP2 shards the model across 4 such groups:

	.. code-block:: yaml

	trainer:
	devices: 8
	strategy:
	_target_: lightning.pytorch.strategies.ModelParallelStrategy
	data_parallel_size: 4
	tensor_parallel_size: 2

	See the SpeechLM2 example configs in ``examples/speechlm2/conf/`` for complete training
	configurations including data and optimizer settings.