|
|
Communication Overlap |
|
|
===================== |
|
|
|
|
|
Data-parallel Communication Overlap |
|
|
|
|
|
|
|
|
NeMo supports the overlap of the data-parallel (DP) communications with the computations in LLM training. |
|
|
NeMo features Distributed Optimizer that distributes optimizer states and the high-precision master parameters across GPUs. This introduces two types of data-parallel communications: reduce-scatter of gradients and all-gather of updated parameters. |
|
|
The DP communication is chunked by the granularity of a Transformer layer and overlaps each communication chunk with computation. |
|
|
This overlap method exposes only one DP communication chunk ensuring efficient large-scale LLM training. |
|
|
When training with pipeline-parallelism, the granularity of DP communication becomes the Transformer layers per virtual pipeline stage. |
|
|
|
|
|
DP communication overlap settings can be inspected in Megatron Core via the `DistributedDataParallelConfig` class: |
|
|
`DistributedDataParallelConfig <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/distributed_data_parallel_config.py>`_. |
|
|
DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting ``overlap_grad_sync=True`` and ``overlap_param_gather=True``, respectively. |
|
|
The precision of gradient reduce-scatter is controlled by ``grad_reduce_in_fp32``. When ``grad_reduce_in_fp32=False``, gradients are reduced in `bf16`, leading to improved performance in large-scale training compared to the default `fp32` precision. |
|
|
When training in fp8 computing precision, setting ``fp8_param_gather=True`` conducts the parameter all-gather in fp8, reducing the all-gather overhead by half. |
|
|
|
|
|
To modify these configurations, manually update the training recipe as follows: |
|
|
|
|
|
.. code-block:: python |
|
|
|
|
|
from nemo.collections import llm |
|
|
from functools import partial |
|
|
|
|
|
# Load training recipe |
|
|
recipe = partial(llm.llama3_8b.pretrain_recipe)() |
|
|
|
|
|
recipe.strategy.ddp_config.overlap_grad_sync = False # Default is True |
|
|
recipe.strategy.ddp_config.overlap_param_gather = False # Default is True |
|
|
# Similar changes can be made for other DDP configurations. |
|
|
|
|
|
|
|
|
Tensor-parallel Communication Overlap |
|
|
|
|
|
|
|
|
Tensor parallelism, used with the sequence-parallel activation sharding (``sequence_parallel=True``), introduces activation (gradient) all-gather and reduce-scatter as shown in the below figure. |
|
|
NeMo provides various options to overlap the tensor-parallel (TP) communications with computation. |
|
|
The TP communication without direct computation dependency are overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes). |
|
|
The bulk TP communication is enabled by default. |
|
|
The other TP communications with direct computation dependency are overlapped in pipelined fashion (the linear layer and TP communication pairs in the red boxes). |
|
|
The TP communication and computation are chunked and the chunks are overlapped in pipeline. |
|
|
In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges, and reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs. |
|
|
In case of the reduce-scatter overlap, NeMo also provides the option to pipeline-overlap using chunks of reduce-scatter, which exposes one reduce-scatter chunk. |
|
|
|
|
|
|
|
|
.. image:: ../../nlp/nemo_megatron/images/tp_comm_overlap.png |
|
|
:align: center |
|
|
:width: 600px |
|
|
:alt: Tensor-parallel communication overlap |
|
|
|
|
|
TP communication overlap configurations are added via the callback `MegatronCommOverlapCallback <https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L61>`_. |
|
|
Pipelined TP communication overlap is implemented in Transformer Engine and can be enabled by setting ``tp_comm_overlap=True``. |
|
|
The individual bulk, pipelined all-gather, and reduce-scatter operations can be enabled or disabled using ``tp_comm_overlap_cfg``. |
|
|
For detailed configuration, refer to `TransformerLayerTPOverlapCfg <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py#L64>`_. |
|
|
|
|
|
To modify these configurations, manually update the training recipe as follows: |
|
|
|
|
|
.. code-block:: python |
|
|
|
|
|
from nemo.collections import llm |
|
|
from functools import partial |
|
|
from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback |
|
|
|
|
|
# Load training recipe |
|
|
recipe = partial(llm.llama3_8b.pretrain_recipe)() |
|
|
|
|
|
# Remove existing MegatronCommOverlapCallback |
|
|
recipe.trainer.callbacks = [ |
|
|
callback for callback in recipe.trainer.callbacks |
|
|
if not isinstance(callback, MegatronCommOverlapCallback) |
|
|
] |
|
|
|
|
|
# Append new callback with updated configuration |
|
|
recipe.trainer.callbacks.append( |
|
|
MegatronCommOverlapCallback(tp_comm_overlap=False) |
|
|
) |
|
|
|
|
|
Pipeline-parallel Communication Overlap |
|
|
|
|
|
|
|
|
Pipelining introduces P2P activation (gradient) sends and receives between pipeline-parallel (PP) GPUs. |
|
|
The PP communication frequency increases when increasing the virtual-pipeline-parallel size because the number of Transformer layers executed per micro-batch decreases. |
|
|
This increasing PP communication overhead and it cancels off the reduced the pipeline bubbles with virtual pipelining. |
|
|
NeMo supports the overlap of the PP communications with non-dependant computations in the 1F1B stage (the body of pipelining, where 1X forward and 1X backward micro-batch executions are interleaved). |
|
|
The PP communications in pipeline fill and flush are still exposed. |
|
|
|
|
|
.. image:: ../../nlp/nemo_megatron/images/pp_comm_overlap.png |
|
|
:align: center |
|
|
:width: 600px |
|
|
:alt: Pipeline-parallel communication overlap in 1F1B pipelining phase |
|
|
|
|
|
The PP communication overlap is enabled when setting ``overlap_p2p_comm=True``. Also, setting ``batch_p2p_comm=False`` uses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization. |
|
|
NeMo supports PP communication overlap only with virtual pipelining, where PP communication becomes the performance bottleneck. |
|
|
Please refer `GPT3 training config file <https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/training/gpt3/175b.yaml>`_ that uses the PP communication overlap. |
|
|
|
|
|
Similar to TP communication overlap, PP communication overlap configurations are added via the callback `MegatronCommOverlapCallback <https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L61>`_. |
|
|
The PP communication overlap is enabled when setting ``overlap_p2p_comm=True``. Also, setting ``batch_p2p_comm=False`` uses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization. |
|
|
NeMo supports PP communication overlap only with virtual pipelining, where PP communication becomes the performance bottleneck. |
|
|
|
|
|
To modify these configurations, manually update the training recipe as follows: |
|
|
|
|
|
.. code-block:: python |
|
|
|
|
|
from nemo.collections import llm |
|
|
from functools import partial |
|
|
from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback |
|
|
|
|
|
# Load training recipe |
|
|
recipe = partial(llm.llama3_8b.pretrain_recipe)() |
|
|
|
|
|
# Remove existing MegatronCommOverlapCallback |
|
|
recipe.trainer.callbacks = [ |
|
|
callback for callback in recipe.trainer.callbacks |
|
|
if not isinstance(callback, MegatronCommOverlapCallback) |
|
|
] |
|
|
|
|
|
# Append new callback with updated configuration |
|
|
recipe.trainer.callbacks.append( |
|
|
MegatronCommOverlapCallback(overlap_p2p_comm=True, batch_p2p_comm=False) |
|
|
) |
|
|
|
|
|
|
|
|
Context-parallel Communication Overlap |
|
|
|
|
|
|
|
|
Context parallelism partitions activations (gradients) on all layers in the sequence domain. This introduces all-gather and reduce-scatter of activations (gradients) in self-attention forward- and back-propagations. |
|
|
NeMo hides the context-parallel (CP) communications under the self-attention computation. |
|
|
Like the TP communication overlaps, the CP communications are chunked then pipeline-overlapped with the self-attention computation, where the all-gather and the reduce-scatter of activations (gradients) are replaced with P2P ring exchanges of data. |
|
|
|
|
|
The CP communication overlap is default enabled when context parallelism is used (``context_parallel_size > 1``). |
|
|
|