Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Distributed inference","local":"distributed-inference","sections":[{"title":"Sending chunks of a batch automatically to each loaded model","local":"sending-chunks-of-a-batch-automatically-to-each-loaded-model","sections":[],"depth":2},{"title":"Memory-efficient pipeline parallelism (experimental)","local":"memory-efficient-pipeline-parallelism-experimental","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/accelerate/pr_4021/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/entry/start.8a49e72b.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/scheduler.b9285784.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/singletons.7547c222.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/index.6d423e5c.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/paths.d42c9205.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/entry/app.1df4d18e.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/preload-helper.b0bd19d1.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/index.26bc89a1.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/nodes/0.0e7c56e8.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/nodes/44.22ccf593.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/Tip.e4eba3d6.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/MermaidChart.svelte_svelte_type_style_lang.7a0ae628.js"> | |
| <link rel="modulepreload" href="/docs/accelerate/pr_4021/en/_app/immutable/chunks/CodeBlock.844ff9c3.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Distributed inference","local":"distributed-inference","sections":[{"title":"Sending chunks of a batch automatically to each loaded model","local":"sending-chunks-of-a-batch-automatically-to-each-loaded-model","sections":[],"depth":2},{"title":"Memory-efficient pipeline parallelism (experimental)","local":"memory-efficient-pipeline-parallelism-experimental","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <div class="items-center shrink-0 min-w-[100px] max-sm:min-w-[50px] justify-end ml-auto flex" style="float: right; margin-left: 10px; display: inline-flex; position: relative; z-index: 10;"><div class="inline-flex rounded-md max-sm:rounded-sm"><button class="inline-flex items-center gap-1 h-7 max-sm:h-7 px-2 max-sm:px-1.5 text-sm font-medium text-gray-800 border border-r-0 rounded-l-md max-sm:rounded-l-sm border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-live="polite"><span class="inline-flex items-center justify-center rounded-md p-0.5 max-sm:p-0 hover:text-gray-800 dark:hover:text-gray-200"><svg class="sm:size-3.5 size-3" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg></span> <span>Copy page</span></button> <button class="inline-flex items-center justify-center w-6 max-sm:w-5 h-7 max-sm:h-7 disabled:pointer-events-none text-sm text-gray-500 hover:text-gray-700 dark:hover:text-white rounded-r-md max-sm:rounded-r-sm border border-l transition border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-haspopup="menu" aria-expanded="false" aria-label="Open copy menu"><svg class="transition-transform text-gray-400 overflow-visible sm:size-3.5 size-3 rotate-0" width="1em" height="1em" viewBox="0 0 12 7" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M1 1L6 6L11 1" stroke="currentColor"></path></svg></button></div> </div> <h1 class="relative group"><a id="distributed-inference" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#distributed-inference"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Distributed inference</span></h1> <p data-svelte-h="svelte-pzys4w">Distributed inference can fall into three brackets:</p> <ol data-svelte-h="svelte-auhvyh"><li>Loading an entire model onto each GPU and sending chunks of a batch through each GPU’s model copy at a time</li> <li>Loading parts of a model onto each GPU and processing a single input at one time</li> <li>Loading parts of a model onto each GPU and using what is called scheduled Pipeline Parallelism to combine the two prior techniques.</li></ol> <p data-svelte-h="svelte-105ktyk">We’re going to go through the first and the last bracket, showcasing how to do each as they are more realistic scenarios.</p> <h2 class="relative group"><a id="sending-chunks-of-a-batch-automatically-to-each-loaded-model" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#sending-chunks-of-a-batch-automatically-to-each-loaded-model"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Sending chunks of a batch automatically to each loaded model</span></h2> <p data-svelte-h="svelte-1ojc10y">This is the most memory-intensive solution, as it requires each GPU to keep a full copy of the model in memory at a given time.</p> <p data-svelte-h="svelte-1qi053h">Normally when doing this, users send the model to a specific device to load it from the CPU, and then move each prompt to a different device.</p> <p data-svelte-h="svelte-rhvic4">A basic pipeline using the <code>diffusers</code> library might look something like so:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch | |
| <span class="hljs-keyword">import</span> torch.distributed <span class="hljs-keyword">as</span> dist | |
| <span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> DiffusionPipeline | |
| pipe = DiffusionPipeline.from_pretrained(<span class="hljs-string">"runwayml/stable-diffusion-v1-5"</span>, torch_dtype=torch.float16)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1dzspg1">Followed then by performing inference based on the specific prompt:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">run_inference</span>(<span class="hljs-params">rank, world_size</span>): | |
| dist.init_process_group(<span class="hljs-string">"nccl"</span>, rank=rank, world_size=world_size) | |
| pipe.to(rank) | |
| <span class="hljs-keyword">if</span> torch.distributed.get_rank() == <span class="hljs-number">0</span>: | |
| prompt = <span class="hljs-string">"a dog"</span> | |
| <span class="hljs-keyword">elif</span> torch.distributed.get_rank() == <span class="hljs-number">1</span>: | |
| prompt = <span class="hljs-string">"a cat"</span> | |
| result = pipe(prompt).images[<span class="hljs-number">0</span>] | |
| result.save(<span class="hljs-string">f"result_<span class="hljs-subst">{rank}</span>.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1ov43u6">One will notice how we have to check the rank to know what prompt to send, which can be a bit tedious.</p> <p data-svelte-h="svelte-1n5a8vc">A user might then also think that with Accelerate, using the <code>Accelerator</code> to prepare a dataloader for such a task might also be | |
| a simple way to manage this. (To learn more, check out the relevant section in the <a href="../quicktour#distributed-evaluation">Quick Tour</a>)</p> <p data-svelte-h="svelte-12rt8sz">Can it manage it? Yes. Does it add unneeded extra code however: also yes.</p> <p data-svelte-h="svelte-gw34xi">With Accelerate, we can simplify this process by using the <a href="/docs/accelerate/pr_4021/en/package_reference/accelerator#accelerate.Accelerator.split_between_processes">Accelerator.split_between_processes()</a> context manager (which also exists in <code>PartialState</code> and <code>AcceleratorState</code>). | |
| This function will automatically split whatever data you pass to it (be it a prompt, a set of tensors, a dictionary of the prior data, etc.) across all the processes (with a potential | |
| to be padded) for you to use right away.</p> <p data-svelte-h="svelte-8uk323">Let’s rewrite the above example using this context manager:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch | |
| <span class="hljs-keyword">from</span> accelerate <span class="hljs-keyword">import</span> PartialState <span class="hljs-comment"># Can also be Accelerator or AcceleratorState</span> | |
| <span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> DiffusionPipeline | |
| pipe = DiffusionPipeline.from_pretrained(<span class="hljs-string">"runwayml/stable-diffusion-v1-5"</span>, torch_dtype=torch.float16) | |
| distributed_state = PartialState() | |
| pipe.to(distributed_state.device) | |
| <span class="hljs-comment"># Assume two processes</span> | |
| <span class="hljs-keyword">with</span> distributed_state.split_between_processes([<span class="hljs-string">"a dog"</span>, <span class="hljs-string">"a cat"</span>]) <span class="hljs-keyword">as</span> prompt: | |
| result = pipe(prompt).images[<span class="hljs-number">0</span>] | |
| result.save(<span class="hljs-string">f"result_<span class="hljs-subst">{distributed_state.process_index}</span>.png"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-k1tgc3">And then to launch the code, we can use the Accelerate:</p> <p data-svelte-h="svelte-glszdf">If you have generated a config file to be used using <code>accelerate config</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->accelerate launch distributed_inference.py<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1beq8se">If you have a specific config file you want to use:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->accelerate launch --config_file my_config.json distributed_inference.py<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-52p69u">Or if don’t want to make any config files and launch on two GPUs:</p> <blockquote data-svelte-h="svelte-ix7ij8"><p>Note: You will get some warnings about values being guessed based on your system. To remove these you can do <code>accelerate config default</code> or go through <code>accelerate config</code> to create a config file.</p></blockquote> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->accelerate launch --num_processes 2 distributed_inference.py<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ui4crb">We’ve now reduced the boilerplate code needed to split this data to a few lines of code quite easily.</p> <p data-svelte-h="svelte-dyqt5a">But what if we have an odd distribution of prompts to GPUs? For example, what if we have 3 prompts, but only 2 GPUs?</p> <p data-svelte-h="svelte-1o56krx">Under the context manager, the first GPU would receive the first two prompts and the second GPU the third, ensuring that | |
| all prompts are split and no overhead is needed.</p> <p data-svelte-h="svelte-1y9mak3"><em>However</em>, what if we then wanted to do something with the results of <em>all the GPUs</em>? (Say gather them all and perform some kind of post processing) | |
| You can pass in <code>apply_padding=True</code> to ensure that the lists of prompts are padded to the same length, with extra data being taken | |
| from the last sample. This way all GPUs will have the same number of prompts, and you can then gather the results.</p> <blockquote class="tip"><p data-svelte-h="svelte-rtowy2">This is only needed when trying to perform an action such as gathering the results, where the data on each device | |
| needs to be the same length. Basic inference does not require this.</p></blockquote> <p data-svelte-h="svelte-4vay6o">For instance:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch | |
| <span class="hljs-keyword">from</span> accelerate <span class="hljs-keyword">import</span> PartialState <span class="hljs-comment"># Can also be Accelerator or AcceleratorState</span> | |
| <span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> DiffusionPipeline | |
| pipe = DiffusionPipeline.from_pretrained(<span class="hljs-string">"runwayml/stable-diffusion-v1-5"</span>, torch_dtype=torch.float16) | |
| distributed_state = PartialState() | |
| pipe.to(distributed_state.device) | |
| <span class="hljs-comment"># Assume two processes</span> | |
| <span class="hljs-keyword">with</span> distributed_state.split_between_processes([<span class="hljs-string">"a dog"</span>, <span class="hljs-string">"a cat"</span>, <span class="hljs-string">"a chicken"</span>], apply_padding=<span class="hljs-literal">True</span>) <span class="hljs-keyword">as</span> prompt: | |
| result = pipe(prompt).images<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-42h08l">On the first GPU, the prompts will be <code>["a dog", "a cat"]</code>, and on the second GPU it will be <code>["a chicken", "a chicken"]</code>. | |
| Make sure to drop the final sample, as it will be a duplicate of the previous one.</p> <p data-svelte-h="svelte-13go2fo">You can find more complex examples <a href="https://github.com/huggingface/accelerate/tree/main/examples/inference/distributed" rel="nofollow">here</a> such as how to use it with LLMs.</p> <h2 class="relative group"><a id="memory-efficient-pipeline-parallelism-experimental" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#memory-efficient-pipeline-parallelism-experimental"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Memory-efficient pipeline parallelism (experimental)</span></h2> <p data-svelte-h="svelte-ynzvto">This next part will discuss using <em>pipeline parallelism</em>. This is an <strong>experimental</strong> API that utilizes <a href="https://pytorch.org/docs/stable/distributed.pipelining.html#" rel="nofollow">torch.distributed.pipelining</a> as a native solution.</p> <p data-svelte-h="svelte-1l3sfxh">The general idea with pipeline parallelism is: say you have 4 GPUs and a model big enough it can be <em>split</em> on four GPUs using <code>device_map="auto"</code>. With this method you can send in 4 inputs at a time (for example here, any amount works) and each model chunk will work on an input, then receive the next input once the prior chunk finished, making it <em>much</em> more efficient <strong>and faster</strong> than the method described earlier. Here’s a visual taken from the PyTorch repository:</p> <p data-svelte-h="svelte-l2zgqj"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/accelerate/pipeline_parallel.png" alt="Pipeline parallelism example"></p> <p data-svelte-h="svelte-heqe2k">To illustrate how you can use this with Accelerate, we have created an <a href="https://github.com/huggingface/accelerate/tree/main/examples/inference" rel="nofollow">example zoo</a> showcasing a number of different models and situations. In this tutorial, we’ll show this method for GPT2 across two GPUs.</p> <p data-svelte-h="svelte-1pr52sj">Before you proceed, please make sure you have the latest PyTorch version installed by running the following:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install torch<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mviugk">Start by creating the model on the CPU:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->from transformers <span class="hljs-keyword">import</span> GPT2ForSequenceClassification, GPT2Config | |
| config = <span class="hljs-built_in">GPT2Config</span>() | |
| model = <span class="hljs-built_in">GPT2ForSequenceClassification</span>(config) | |
| model.<span class="hljs-built_in">eval</span>()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-15xaqzv">Next you’ll need to create some example inputs to use. These help <code>torch.distributed.pipelining</code> trace the model.</p> <blockquote class="warning">However you make this example will determine the relative batch size that will be used/passed | |
| through the model at a given time, so make sure to remember how many items there are!</blockquote> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->input = torch.randint( | |
| <span class="hljs-attribute">low</span>=0, | |
| <span class="hljs-attribute">high</span>=config.vocab_size, | |
| size=(2, 1024), # bs x seq_len | |
| <span class="hljs-attribute">device</span>=<span class="hljs-string">"cpu"</span>, | |
| <span class="hljs-attribute">dtype</span>=torch.int64, | |
| <span class="hljs-attribute">requires_grad</span>=<span class="hljs-literal">False</span>, | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-11mjb3e">Next we need to actually perform the tracing and get the model ready. To do so, use the <a href="/docs/accelerate/pr_4021/en/package_reference/inference#accelerate.prepare_pippy">inference.prepare_pippy()</a> function and it will fully wrap the model for pipeline parallelism automatically:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> accelerate.inference <span class="hljs-keyword">import</span> prepare_pippy | |
| example_inputs = {"input_ids": <span class="hljs-keyword">input</span>} | |
| model = prepare_pippy(model, example_args=(<span class="hljs-keyword">input</span>,))<!-- HTML_TAG_END --></pre></div> <blockquote class="tip"><p data-svelte-h="svelte-123k9fd">There are a variety of parameters you can pass through to <code>prepare_pippy</code>:</p> <ul data-svelte-h="svelte-1bax0io"><li><p><code>split_points</code> lets you determine what layers to split the model at. By default we use wherever <code>device_map="auto" declares, such as </code>fc<code>or</code>conv1`.</p></li> <li><p><code>num_chunks</code> determines how the batch will be split and sent to the model itself (so <code>num_chunks=1</code> with four split points/four GPUs will have a naive MP where a single input gets passed between the four layer split points)</p></li></ul></blockquote> <p data-svelte-h="svelte-t45mca">From here, all that’s left is to actually perform the distributed inference!</p> <blockquote class="warning"><p data-svelte-h="svelte-1cjitly">When passing inputs, we highly recommend to pass them in as a tuple of arguments. Using <code>kwargs</code> is supported, however, this approach is experimental.</p></blockquote> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-variable">args</span> = <span class="hljs-variable">some_more_arguments</span> | |
| <span class="hljs-variable">with</span> <span class="hljs-variable">torch.no_grad</span>(): | |
| <span class="hljs-variable">output</span> = <span class="hljs-function"><span class="hljs-title">model</span>(*<span class="hljs-variable">args</span>)</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1f6tfb4">When finished all the data will be on the last process only:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> accelerate <span class="hljs-keyword">import</span> PartialState | |
| <span class="hljs-keyword">if</span> PartialState().is_last_process: | |
| <span class="hljs-built_in">print</span>(output)<!-- HTML_TAG_END --></pre></div> <blockquote class="tip"><p data-svelte-h="svelte-fqz35p">If you pass in <code>gather_output=True</code> to <a href="/docs/accelerate/pr_4021/en/package_reference/inference#accelerate.prepare_pippy">inference.prepare_pippy()</a>, the output will be sent | |
| across to all the GPUs afterwards without needing the <code>is_last_process</code> check. This is | |
| <code>False</code> by default as it incurs a communication call.</p></blockquote> <p data-svelte-h="svelte-bz2a3m">And that’s it! To explore more, please check out the inference examples in the <a href="https://github.com/huggingface/accelerate/tree/main/examples/inference/pippy" rel="nofollow">Accelerate repo</a> and our <a href="../package_reference/inference">documentation</a> as we work to improving this integration.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/accelerate/blob/main/docs/source/usage_guides/distributed_inference.md" target="_blank"><svg class="mr-1" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M31,16l-7,7l-1.41-1.41L28.17,16l-5.58-5.59L24,9l7,7z"></path><path d="M1,16l7-7l1.41,1.41L3.83,16l5.58,5.59L8,23l-7-7z"></path><path d="M12.419,25.484L17.639,6.552l1.932,0.518L14.351,26.002z"></path></svg> <span data-svelte-h="svelte-zjs2n5"><span class="underline">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1q7nz6m = { | |
| assets: "/docs/accelerate/pr_4021/en", | |
| base: "/docs/accelerate/pr_4021/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/accelerate/pr_4021/en/_app/immutable/entry/start.8a49e72b.js"), | |
| import("/docs/accelerate/pr_4021/en/_app/immutable/entry/app.1df4d18e.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 44], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 40 kB
- Xet hash:
- ebd060607811ae285c78bd87077ace23959c59eccf055096d782fc03afe565fe
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.