| ## Zero Bubble Schedules | |
| The key of achieving zero bubble is to breaking a backward pass into a B pass and W pass. B on one stage will only depend on the B on its next stage, compared to depending on both B and W of in 1F1B. | |
|  | |
| ### Comparision of Schedules | |
| * 1F1B | |
|  | |
| * ZB1P | |
|  | |
| * ZB2P | |
|  | |
| * ZBV - Each device is assigned to exactly 2 chunks (virtual stages), where white text colors represent the first chunk and black text colors represent the second chunk. The sequence of dependencies among model chunks follows a βVβ shape pattern for both the forward and backward passes. | |
|  | |
| | Comparison assuming T_F=T_B=T_W | 1F1B | ZB1P | ZB2P | ZBV (Recommended) | | |
| | ----------------------------------------------------- | ------- | -------- | ---- | --- | | |
| | Bubble Rate | (p-1)/(m+p-1) | (p-1)/3(m+p-1) | 0 | 0 | | |
| | Activation Memory <br> (Compared to 1F1B) | 1x | 1x | 2x | 1x | | |
| | Pipeline Communication Volume <br> (Compared to 1F1B) | 1x | 1x | 1x | 2x | | |
| ## Optimizer Post Validation | |
| In most practices of PP there's an all-reduce cross all pipeline stages for numerical robustness, e.g. global gradient norm for gradient clipping. INF/NAN check for mixed precision training, etc. This all-reduce breaks parallelogram and makes zero bubble impossible. | |
| Under the observation that during a stable training both the gradient clipping and INF/NAN rarely triggers, we replace the before-hand synchronizations with a post update validation. | |
|  | |
| We eagerly step the optimizers assuming the grad cliping, INF/NAN conditions are not triggered. In case an amendment to the gradient is required, a rollback will be issued and then we redo the optimizer step based on the fully reduced global state. |