Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Parallelism methods","local":"parallelism-methods","sections":[{"title":"Scalability strategy","local":"scalability-strategy","sections":[],"depth":2},{"title":"Data parallelism","local":"data-parallelism","sections":[{"title":"DataParallel","local":"dataparallel","sections":[],"depth":3},{"title":"DistributedDataParallel","local":"distributeddataparallel","sections":[],"depth":3},{"title":"ZeRO data parallelism","local":"zero-data-parallelism","sections":[],"depth":3}],"depth":2},{"title":"Model parallelism","local":"model-parallelism","sections":[],"depth":2},{"title":"Pipeline parallelism","local":"pipeline-parallelism","sections":[],"depth":2},{"title":"Tensor parallelism","local":"tensor-parallelism","sections":[],"depth":2},{"title":"Hybrid parallelism","local":"hybrid-parallelism","sections":[{"title":"Data parallelism and pipeline parallelism","local":"data-parallelism-and-pipeline-parallelism","sections":[],"depth":3},{"title":"ZeRO data parallelism, pipeline parallelism, and model parallelism (3D parallelism)","local":"zero-data-parallelism-pipeline-parallelism-and-model-parallelism-3d-parallelism","sections":[],"depth":3}],"depth":2}],"depth":1}"> | |
| <link href="/docs/transformers/pr_36839/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/entry/start.6be8d590.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/chunks/scheduler.01eeda35.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/chunks/singletons.177df05e.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/chunks/index.4862150a.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/chunks/paths.517376d1.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/entry/app.09748b4b.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/chunks/index.6dd51b66.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/nodes/0.8897c14d.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/nodes/420.18fde185.js"> | |
| <link rel="modulepreload" href="/docs/transformers/pr_36839/en/_app/immutable/chunks/EditOnGithub.7faefd25.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Parallelism methods","local":"parallelism-methods","sections":[{"title":"Scalability strategy","local":"scalability-strategy","sections":[],"depth":2},{"title":"Data parallelism","local":"data-parallelism","sections":[{"title":"DataParallel","local":"dataparallel","sections":[],"depth":3},{"title":"DistributedDataParallel","local":"distributeddataparallel","sections":[],"depth":3},{"title":"ZeRO data parallelism","local":"zero-data-parallelism","sections":[],"depth":3}],"depth":2},{"title":"Model parallelism","local":"model-parallelism","sections":[],"depth":2},{"title":"Pipeline parallelism","local":"pipeline-parallelism","sections":[],"depth":2},{"title":"Tensor parallelism","local":"tensor-parallelism","sections":[],"depth":2},{"title":"Hybrid parallelism","local":"hybrid-parallelism","sections":[{"title":"Data parallelism and pipeline parallelism","local":"data-parallelism-and-pipeline-parallelism","sections":[],"depth":3},{"title":"ZeRO data parallelism, pipeline parallelism, and model parallelism (3D parallelism)","local":"zero-data-parallelism-pipeline-parallelism-and-model-parallelism-3d-parallelism","sections":[],"depth":3}],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="parallelism-methods" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#parallelism-methods"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Parallelism methods</span></h1> <p data-svelte-h="svelte-1qer6co">Multi-GPU setups are effective for accelerating training and fitting large models in memory that otherwise wouldn’t fit on a single GPU. It relies on parallelizing the workload across GPUs. There are several types of parallelism such as data parallelism, tensor parallelism, pipeline parallelism, and model parallelism. Each type of parallelism splits the workload differently, whether it’s the data or the model.</p> <p data-svelte-h="svelte-14ilg8d">This guide will discuss the various parallelism methods, combining them, and choosing an appropriate strategy for your setup. For more details about distributed training, refer to the <a href="https://hf.co/docs/accelerate/index" rel="nofollow">Accelerate</a> documentation.</p> <p data-svelte-h="svelte-518p0f">For a comprehensive guide on scaling large language models, check out the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook" rel="nofollow">Ultrascale Playbook</a>, which provides detailed strategies and best practices for training at scale.</p> <h2 class="relative group"><a id="scalability-strategy" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#scalability-strategy"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Scalability strategy</span></h2> <p data-svelte-h="svelte-145rkrl">Use the <a href="https://huggingface.co/spaces/hf-accelerate/model-memory-usage" rel="nofollow">Model Memory Calculator</a> to calculate how much memory a model requires. Then refer to the table below to select a strategy based on your setup.</p> <table data-svelte-h="svelte-170m9u1"><thead><tr><th>setup</th> <th>scenario</th> <th>strategy</th></tr></thead> <tbody><tr><td>single node/multi-GPU</td> <td>fits on single GPU</td> <td>DistributedDataParallel or ZeRO</td></tr> <tr><td></td> <td>doesn’t fit on single GPU</td> <td>PipelineParallel, ZeRO or TensorParallel</td></tr> <tr><td></td> <td>largest model layer doesn’t fit</td> <td>TensorParallel or ZeRO</td></tr> <tr><td>multi-node/multi-GPU</td> <td>fast inter-node connectivity (NVLink or NVSwitch)</td> <td>ZeRO or 3D parallelism (PipelineParallel, TensorParallel, DataParallel)</td></tr> <tr><td></td> <td>slow inter-node connectivity</td> <td>ZeRO or 3D parallelism (PipelineParallel, TensorParallel, DataParallel)</td></tr></tbody></table> <h2 class="relative group"><a id="data-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#data-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Data parallelism</span></h2> <p data-svelte-h="svelte-rz1c4v">Data parallelism evenly distributes data across multiple GPUs. Each GPU holds a copy of the model and concurrently processes their portion of the data. At the end, the results from each GPU are synchronized and combined.</p> <p data-svelte-h="svelte-wh4l6y">Data parallelism significantly reduces training time by processing data in parallel, and it is scalable to the number of GPUs available. However, synchronizing results from each GPU can add overhead.</p> <p data-svelte-h="svelte-ws2e5u">There are two types of data parallelism, DataParallel (DP) and DistributedDataParallel (DDP).</p> <h3 class="relative group"><a id="dataparallel" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#dataparallel"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>DataParallel</span></h3> <p data-svelte-h="svelte-ca23rj"><a href="https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html" rel="nofollow">DataParallel</a> supports distributed training on a <em>single machine</em> with multiple GPUs.</p> <ol data-svelte-h="svelte-gxqupu"><li>The default GPU, <code>GPU 0</code>, reads a batch of data and sends a mini batch of it to the other GPUs.</li> <li>An up-to-date model is replicated from <code>GPU 0</code> to the other GPUs.</li> <li>A <code>forward</code> pass is performed on each GPU and their outputs are sent to <code>GPU 0</code> to compute the loss.</li> <li>The loss is distributed from <code>GPU 0</code> to the other GPUs for the <code>backward</code> pass.</li> <li>The gradients from each GPU are sent back to <code>GPU 0</code> and averaged.</li></ol> <h3 class="relative group"><a id="distributeddataparallel" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#distributeddataparallel"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>DistributedDataParallel</span></h3> <p data-svelte-h="svelte-1gl0r79"><a href="https://pytorch.org/docs/main/notes/ddp.html" rel="nofollow">DistributedDataParallel</a> supports distributed training across <em>multiple machines</em> with multiple GPUs.</p> <ol data-svelte-h="svelte-7q0d0c"><li>The main process replicates the model from the default GPU, <code>GPU 0</code>, to each GPU.</li> <li>Each GPU directly processes a mini batch of data.</li> <li>The local gradients are averaged across all GPUs during the <code>backward</code> pass.</li></ol> <p data-svelte-h="svelte-10vdoy0">DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine.</p> <h3 class="relative group"><a id="zero-data-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#zero-data-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>ZeRO data parallelism</span></h3> <p data-svelte-h="svelte-1kxh7qs"><a href="https://www.deepspeed.ai/tutorials/zero/" rel="nofollow">Zero Redundancy Optimizer</a> is a more memory efficient type of data parallelism. It significantly improves memory efficiency by partitioning parameters, gradients, and optimizer states across data parallel processes to reduce memory usage. There are three ZeRO stages:</p> <ul data-svelte-h="svelte-1ltnzk6"><li>Stage 1 partitions the optimizer states</li> <li>Stage 2 partitions the optimizer and gradient states</li> <li>Stage 3 partitions the optimizer, gradient, and parameters</li></ul> <div class="flex justify-center" data-svelte-h="svelte-5tac98"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png"></div> <h2 class="relative group"><a id="model-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#model-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Model parallelism</span></h2> <p data-svelte-h="svelte-jq6w6f">Model parallelism distributes a model across multiple GPUs. There are several ways to split a model, but the typical method distributes the model layers across GPUs. On the <code>forward</code> pass, the first GPU processes a batch of data and passes it to the next group of layers on the next GPU. For the <code>backward</code> pass, the data is sent backward from the final layer to the first layer.</p> <p data-svelte-h="svelte-qbmj8o">Model parallelism is a useful strategy for training models that are too large to fit into the memory of a single GPU. However, GPU utilization is unbalanced because only one GPU is active at a time. Passing results between GPUs also adds communication overhead and it can be a bottleneck.</p> <h2 class="relative group"><a id="pipeline-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#pipeline-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Pipeline parallelism</span></h2> <p data-svelte-h="svelte-6fcjk9">Pipeline parallelism is conceptually very similar to model parallelism, but it’s more efficient because it reduces the amount of idle GPU time. Instead of waiting for each GPU to finish processing a batch of data, pipeline parallelism creates <em>micro-batches</em> of data. As soon as one micro-batch is finished, it is passed to the next GPU. This way, each GPU can concurrently process part of the data without waiting for the other GPU to completely finish processing a mini batch of data.</p> <p data-svelte-h="svelte-1lu6x7v">Pipeline parallelism shares the same advantages as model parallelism, but it optimizes GPU utilization and reduces idle time. But pipeline parallelism can be more complex because models may need to be rewritten as a sequence of <a href="https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html" rel="nofollow">nn.Sequential</a> modules and it also isn’t possible to completely reduce idle time because the last <code>forward</code> pass must also wait for the <code>backward</code> pass to finish.</p> <h2 class="relative group"><a id="tensor-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#tensor-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Tensor parallelism</span></h2> <p data-svelte-h="svelte-f2grft">Tensor parallelism distributes large tensor computations across multiple GPUs. The tensors are sliced horizontally or vertically and each slice is processed by a separate GPU. Each GPU performs its calculations on its tensor slice and the results are synchronized at the end to reconstruct the final result.</p> <p data-svelte-h="svelte-c1tumf">Tensor parallelism is effective for training large models that don’t fit into the memory of a single GPU. It is also faster and more efficient because each GPU can process its tensor slice in parallel, and it can be combined with other parallelism methods. Like other parallelism methods though, tensor parallelism adds communication overhead between GPUs.</p> <h2 class="relative group"><a id="hybrid-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#hybrid-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Hybrid parallelism</span></h2> <p data-svelte-h="svelte-fef8gx">Parallelism methods can be combined to achieve even greater memory savings and more efficiently train models with billions of parameters.</p> <h3 class="relative group"><a id="data-parallelism-and-pipeline-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#data-parallelism-and-pipeline-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Data parallelism and pipeline parallelism</span></h3> <p data-svelte-h="svelte-n4ng03">Data and pipeline parallelism distributes the data across GPUs and divides each mini batch of data into micro-batches to achieve pipeline parallelism.</p> <p data-svelte-h="svelte-1kl2xop">Each data parallel rank treats the process as if there were only one GPU instead of two, but GPUs 0 and 1 can offload micro-batches of data to GPUs 2 and 3 and reduce idle time.</p> <p data-svelte-h="svelte-3x6w3o">This approach optimizes parallel data processing by reducing idle GPU utilization.</p> <div class="flex justify-center" data-svelte-h="svelte-9tungy"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero-dp-pp.png"></div> <h3 class="relative group"><a id="zero-data-parallelism-pipeline-parallelism-and-model-parallelism-3d-parallelism" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#zero-data-parallelism-pipeline-parallelism-and-model-parallelism-3d-parallelism"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>ZeRO data parallelism, pipeline parallelism, and model parallelism (3D parallelism)</span></h3> <p data-svelte-h="svelte-rqxrr3">Data, pipeline and model parallelism combine to form <a href="https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/" rel="nofollow">3D parallelism</a> to optimize memory and compute efficiency.</p> <p data-svelte-h="svelte-n76min">Memory effiiciency is achieved by splitting the model across GPUs and also dividing it into stages to create a pipeline. This allows GPUs to work in parallel on micro-batches of data, reducing the memory usage of the model, optimizer, and activations.</p> <p data-svelte-h="svelte-17jf8fp">Compute efficiency is enabled by ZeRO data parallelism where each GPU only stores a slice of the model, optimizer, and activations. This allows higher communication bandwidth between data parallel nodes because communication can occur independently or in parallel with the other pipeline stages.</p> <p data-svelte-h="svelte-1orw26f">This approach is scalable to extremely large models with trillions of parameters.</p> <div class="flex justify-center" data-svelte-h="svelte-1qk46jz"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-deepspeed-3d.png"></div> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/transformers/blob/main/docs/source/en/perf_train_gpu_many.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1bm5psi = { | |
| assets: "/docs/transformers/pr_36839/en", | |
| base: "/docs/transformers/pr_36839/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/transformers/pr_36839/en/_app/immutable/entry/start.6be8d590.js"), | |
| import("/docs/transformers/pr_36839/en/_app/immutable/entry/app.09748b4b.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 420], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 29 kB
- Xet hash:
- 42153ff33fe2e7c2cb5adc7bccf073d17f62052a5cca3d359b7c499933b3edf3
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.