distributed package

This package contains various utilities to finalize model weight gradients on each rank before the optimizer step. This includes a distributed data parallelism wrapper to all-reduce or reduce-scatter the gradients across data-parallel replicas, and a finalize_model_grads method to synchronize gradients across different parallelism modes (e.g., 'tied' layers on different pipeline stages, or gradients for experts in a MoE on different ranks due to expert parallelism).