NeMo
Megatron-LM / docs /api-guide /core /distributed.md
KexuanShi's picture
Upload folder using huggingface_hub
88e6849 verified
|
Raw
History Blame Contribute Delete
497 Bytes
# distributed package
This package contains various utilities to finalize model weight gradients
on each rank before the optimizer step. This includes a distributed data
parallelism wrapper to all-reduce or reduce-scatter the gradients across
data-parallel replicas, and a `finalize_model_grads` method to
synchronize gradients across different parallelism modes (e.g., 'tied'
layers on different pipeline stages, or gradients for experts in a MoE on
different ranks due to expert parallelism).