NeMo
Megatron-LM / docs /api-guide /core /distributed.md
KexuanShi's picture
Upload folder using huggingface_hub
88e6849 verified
|
Raw
History Blame Contribute Delete
497 Bytes

distributed package

This package contains various utilities to finalize model weight gradients on each rank before the optimizer step. This includes a distributed data parallelism wrapper to all-reduce or reduce-scatter the gradients across data-parallel replicas, and a finalize_model_grads method to synchronize gradients across different parallelism modes (e.g., 'tied' layers on different pipeline stages, or gradients for experts in a MoE on different ranks due to expert parallelism).