Spaces:
Runtime error
Runtime error
| Communication Overlap | |
| ===================== | |
| Data-parallel Communication Overlap | |
| ----------------------------------- | |
| NeMo supports the overlap of the data-parallel (DP) communications with the computations in LLM training. | |
| NeMo features Distributed Optimizer that distributes optimizer states and the high-precision master parameters across GPUs. This introduces two types of data-parallel communications: reduce-scatter of gradients and all-gather of updated parameters. | |
| The DP communication is chunked by the granularity of a Transformer layer and overlaps each communication chunk with computation. | |
| This overlap method exposes only one DP communication chunk ensuring efficient large-scale LLM training. | |
| When training with pipeline-parallelism, the granularity of DP communication becomes the Transformer layers per virtual pipeline stage. | |
| DP communication overlap settings can be inspected in Megatron Core via the `DistributedDataParallelConfig` class: | |
| `DistributedDataParallelConfig <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/distributed_data_parallel_config.py>`_. | |
| DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting ``overlap_grad_sync=True`` and ``overlap_param_gather=True``, respectively. | |
| The precision of gradient reduce-scatter is controlled by ``grad_reduce_in_fp32``. When ``grad_reduce_in_fp32=False``, gradients are reduced in `bf16`, leading to improved performance in large-scale training compared to the default `fp32` precision. | |
| When training in fp8 computing precision, setting ``fp8_param_gather=True`` conducts the parameter all-gather in fp8, reducing the all-gather overhead by half. | |
| To modify these configurations, manually update the training recipe as follows: | |
| .. code-block:: python | |
| from nemo.collections import llm | |
| from functools import partial | |
| # Load training recipe | |
| recipe = partial(llm.llama3_8b.pretrain_recipe)() | |
| recipe.strategy.ddp_config.overlap_grad_sync = False # Default is True | |
| recipe.strategy.ddp_config.overlap_param_gather = False # Default is True | |
| # Similar changes can be made for other DDP configurations. | |
| Tensor-parallel Communication Overlap | |
| ------------------------------------- | |
| Tensor parallelism, used with the sequence-parallel activation sharding (``sequence_parallel=True``), introduces activation (gradient) all-gather and reduce-scatter as shown in the below figure. | |
| NeMo provides various options to overlap the tensor-parallel (TP) communications with computation. | |
| The TP communication without direct computation dependency are overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes). | |
| The bulk TP communication is enabled by default. | |
| The other TP communications with direct computation dependency are overlapped in pipelined fashion (the linear layer and TP communication pairs in the red boxes). | |
| The TP communication and computation are chunked and the chunks are overlapped in pipeline. | |
| In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges, and reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs. | |
| In case of the reduce-scatter overlap, NeMo also provides the option to pipeline-overlap using chunks of reduce-scatter, which exposes one reduce-scatter chunk. | |
| .. image:: ../../nlp/nemo_megatron/images/tp_comm_overlap.png | |
| :align: center | |
| :width: 600px | |
| :alt: Tensor-parallel communication overlap | |
| TP communication overlap configurations are added via the callback `MegatronCommOverlapCallback <https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L61>`_. | |
| Pipelined TP communication overlap is implemented in Transformer Engine and can be enabled by setting ``tp_comm_overlap=True``. | |
| The individual bulk, pipelined all-gather, and reduce-scatter operations can be enabled or disabled using ``tp_comm_overlap_cfg``. | |
| For detailed configuration, refer to `TransformerLayerTPOverlapCfg <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py#L64>`_. | |
| To modify these configurations, manually update the training recipe as follows: | |
| .. code-block:: python | |
| from nemo.collections import llm | |
| from functools import partial | |
| from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback | |
| # Load training recipe | |
| recipe = partial(llm.llama3_8b.pretrain_recipe)() | |
| # Remove existing MegatronCommOverlapCallback | |
| recipe.trainer.callbacks = [ | |
| callback for callback in recipe.trainer.callbacks | |
| if not isinstance(callback, MegatronCommOverlapCallback) | |
| ] | |
| # Append new callback with updated configuration | |
| recipe.trainer.callbacks.append( | |
| MegatronCommOverlapCallback(tp_comm_overlap=False) | |
| ) | |
| Pipeline-parallel Communication Overlap | |
| --------------------------------------- | |
| Pipelining introduces P2P activation (gradient) sends and receives between pipeline-parallel (PP) GPUs. | |
| The PP communication frequency increases when increasing the virtual-pipeline-parallel size because the number of Transformer layers executed per micro-batch decreases. | |
| This increasing PP communication overhead and it cancels off the reduced the pipeline bubbles with virtual pipelining. | |
| NeMo supports the overlap of the PP communications with non-dependant computations in the 1F1B stage (the body of pipelining, where 1X forward and 1X backward micro-batch executions are interleaved). | |
| The PP communications in pipeline fill and flush are still exposed. | |
| .. image:: ../../nlp/nemo_megatron/images/pp_comm_overlap.png | |
| :align: center | |
| :width: 600px | |
| :alt: Pipeline-parallel communication overlap in 1F1B pipelining phase | |
| The PP communication overlap is enabled when setting ``overlap_p2p_comm=True``. Also, setting ``batch_p2p_comm=False`` uses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization. | |
| NeMo supports PP communication overlap only with virtual pipelining, where PP communication becomes the performance bottleneck. | |
| Please refer `GPT3 training config file <https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/training/gpt3/175b.yaml>`_ that uses the PP communication overlap. | |
| Similar to TP communication overlap, PP communication overlap configurations are added via the callback `MegatronCommOverlapCallback <https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L61>`_. | |
| The PP communication overlap is enabled when setting ``overlap_p2p_comm=True``. Also, setting ``batch_p2p_comm=False`` uses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization. | |
| NeMo supports PP communication overlap only with virtual pipelining, where PP communication becomes the performance bottleneck. | |
| To modify these configurations, manually update the training recipe as follows: | |
| .. code-block:: python | |
| from nemo.collections import llm | |
| from functools import partial | |
| from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback | |
| # Load training recipe | |
| recipe = partial(llm.llama3_8b.pretrain_recipe)() | |
| # Remove existing MegatronCommOverlapCallback | |
| recipe.trainer.callbacks = [ | |
| callback for callback in recipe.trainer.callbacks | |
| if not isinstance(callback, MegatronCommOverlapCallback) | |
| ] | |
| # Append new callback with updated configuration | |
| recipe.trainer.callbacks.append( | |
| MegatronCommOverlapCallback(overlap_p2p_comm=True, batch_p2p_comm=False) | |
| ) | |
| Context-parallel Communication Overlap | |
| -------------------------------------- | |
| Context parallelism partitions activations (gradients) on all layers in the sequence domain. This introduces all-gather and reduce-scatter of activations (gradients) in self-attention forward- and back-propagations. | |
| NeMo hides the context-parallel (CP) communications under the self-attention computation. | |
| Like the TP communication overlaps, the CP communications are chunked then pipeline-overlapped with the self-attention computation, where the all-gather and the reduce-scatter of activations (gradients) are replaced with P2P ring exchanges of data. | |
| The CP communication overlap is default enabled when context parallelism is used (``context_parallel_size > 1``). | |