NeMo
File size: 497 Bytes
88e6849
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
# distributed package

This package contains various utilities to finalize model weight gradients
on each rank before the optimizer step. This includes a distributed data
parallelism wrapper to all-reduce or reduce-scatter the gradients across
data-parallel replicas, and a `finalize_model_grads` method to
synchronize gradients across different parallelism modes (e.g., 'tied'
layers on different pipeline stages, or gradients for experts in a MoE on
different ranks due to expert parallelism).