Buckets:

rtrm's picture
download
raw
19.8 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Debugging Distributed Operations&quot;,&quot;local&quot;:&quot;debugging-distributed-operations&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Visualizing the problem&quot;,&quot;local&quot;:&quot;visualizing-the-problem&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;The solution&quot;,&quot;local&quot;:&quot;the-solution&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/accelerate/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/entry/start.9292d64f.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/chunks/scheduler.00bde567.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/chunks/singletons.7ccea875.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/chunks/paths.33977732.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/entry/app.599b7725.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/chunks/index.752e2ff6.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/nodes/0.5413a06e.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/nodes/30.064d6cb9.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/chunks/CodeBlock.e62cd1dc.js">
<link rel="modulepreload" href="/docs/accelerate/main/en/_app/immutable/chunks/Heading.476d3364.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Debugging Distributed Operations&quot;,&quot;local&quot;:&quot;debugging-distributed-operations&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Visualizing the problem&quot;,&quot;local&quot;:&quot;visualizing-the-problem&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;The solution&quot;,&quot;local&quot;:&quot;the-solution&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="debugging-distributed-operations" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#debugging-distributed-operations"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Debugging Distributed Operations</span></h1> <p data-svelte-h="svelte-1d1ce2t">When running scripts in a distributed fashion, often functions such as <a href="/docs/accelerate/main/en/package_reference/accelerator#accelerate.Accelerator.gather">Accelerator.gather()</a> and <a href="/docs/accelerate/main/en/package_reference/accelerator#accelerate.Accelerator.reduce">Accelerator.reduce()</a> (and others) are neccessary to grab tensors across devices and perform certain operations on them. However, if the tensors which are being grabbed are not the proper shapes then this will result in your code hanging forever. The only sign that exists of this truly happening is hitting a timeout exception from <code>torch.distributed</code>, but this can get quite costly as usually the timeout is 10 minutes.</p> <p data-svelte-h="svelte-mq3h5o">Accelerate now has a <code>debug</code> mode which adds a neglible amount of time to each operation, but allows it to verify that the inputs you are bringing in can <em>actually</em> perform the operation you want <strong>without</strong> hitting this timeout problem!</p> <h2 class="relative group"><a id="visualizing-the-problem" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#visualizing-the-problem"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Visualizing the problem</span></h2> <p data-svelte-h="svelte-1k3x71u">To have a tangible example of this issue, let’s take the following setup (on 2 GPUs):</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> accelerate <span class="hljs-keyword">import</span> PartialState
state = PartialState()
<span class="hljs-keyword">if</span> state.process_index == <span class="hljs-number">0</span>:
tensor = torch.tensor([[<span class="hljs-number">0.0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>]]).to(state.device)
<span class="hljs-keyword">else</span>:
tensor = torch.tensor([[[<span class="hljs-number">0.0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>], [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>]]]).to(state.device)
broadcast_tensor = broadcast(tensor)
<span class="hljs-built_in">print</span>(broadcast_tensor)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-q2oe9t">We’ve created a single tensor on each device, with two radically different shapes. With this setup if we want to perform an operation such as <a href="/docs/accelerate/main/en/package_reference/utilities#accelerate.utils.broadcast">utils.broadcast()</a>, we would forever hit a timeout because <code>torch.distributed</code> requires that these operations have the <strong>exact same shape</strong> across all processes for it to work.</p> <p data-svelte-h="svelte-xp75q2">If you run this yourself, you will find that <code>broadcast_tensor</code> can be printed on the main process, but its results won’t quite be right, and then it will just hang never printing it on any of the other processes:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->&gt;&gt;&gt; tensor(<span class="hljs-string">[[0, 1, 2, 3, 4]]</span>, device=<span class="hljs-string">&#x27;cuda:0&#x27;</span>)<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="the-solution" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#the-solution"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>The solution</span></h2> <p data-svelte-h="svelte-1gx18x0">By enabling Accelerate’s operational debug mode, Accelerate will properly find and catch errors such as this and provide a very clear traceback immediatly:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Traceback (most recent call last):
File <span class="hljs-string">&quot;/home/zach_mueller_huggingface_co/test.py&quot;</span>, line <span class="hljs-number">18</span>, <span class="hljs-keyword">in</span> &lt;module&gt;
<span class="hljs-selector-tag">main</span>()
File <span class="hljs-string">&quot;/home/zach_mueller_huggingface_co/test.py&quot;</span>, line <span class="hljs-number">15</span>, <span class="hljs-keyword">in</span> <span class="hljs-selector-tag">main</span>
<span class="hljs-selector-tag">main</span>()broadcast_tensor = <span class="hljs-built_in">broadcast</span>(tensor)
File <span class="hljs-string">&quot;/home/zach_mueller_huggingface_co/accelerate/src/accelerate/utils/operations.py&quot;</span>, line <span class="hljs-number">303</span>, <span class="hljs-keyword">in</span> wrapper
broadcast_tensor = <span class="hljs-built_in">broadcast</span>(tensor)
accelerate<span class="hljs-selector-class">.utils</span><span class="hljs-selector-class">.operations</span><span class="hljs-selector-class">.DistributedOperationException</span>: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid.
Operation: `accelerate<span class="hljs-selector-class">.utils</span><span class="hljs-selector-class">.operations</span>.broadcast`
Input shapes:
- Process <span class="hljs-number">0</span>: <span class="hljs-selector-attr">[1, 5]</span>
- Process <span class="hljs-number">1</span>: <span class="hljs-selector-attr">[1, 2, 5]</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1zoq07">This explains that the shapes across our devices were <em>not</em> the same, and that we should ensure that they match properly to be compatible. Typically this means that there is either an extra dimension, or certain dimensions are incompatible with the operation.</p> <p data-svelte-h="svelte-161v7q2">To enable this please do one of the following:</p> <p data-svelte-h="svelte-1oml3n7">Enable it through the questionarre during <code>accelerate config</code> (recommended)</p> <p data-svelte-h="svelte-1wczss9">From the CLI:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment">accelerate launch</span> <span class="hljs-literal">--</span><span class="hljs-comment">debug {my_script</span><span class="hljs-string">.</span><span class="hljs-comment">py}</span> <span class="hljs-literal">--</span><span class="hljs-comment">arg1</span> <span class="hljs-literal">--</span><span class="hljs-comment">arg2</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-15izvyq">As an environmental variable (which avoids the need for <code>accelerate launch</code>):</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->ACCELERATE_DEBUG_MODE=<span class="hljs-string">&quot;1&quot;</span> accelerate <span class="hljs-built_in">launch</span> {my_script.py} <span class="hljs-comment">--arg1 --arg2</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-wougyt">Manually changing the <code>config.yaml</code> file:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --> compute_environment: LOCAL_MACHINE
<span class="hljs-addition">+debug: true</span><!-- HTML_TAG_END --></pre></div> <p></p>
<script>
{
__sveltekit_12ratix = {
assets: "/docs/accelerate/main/en",
base: "/docs/accelerate/main/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/accelerate/main/en/_app/immutable/entry/start.9292d64f.js"),
import("/docs/accelerate/main/en/_app/immutable/entry/app.599b7725.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 30],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
19.8 kB
·
Xet hash:
d78cad0ece4bd9d8653cb7dd7ebdbc576d57dcdf8c6e4e96c1ffd2728480bf01

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.