Buckets:

hf-doc-build/doc / diffusers /v0.25.0 /pt /tutorials /fast_diffusion.html
rtrm's picture
download
raw
49.1 kB
<!-- META HERE --><meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;local&quot;:&quot;accelerate-inference-of-texttoimage-diffusion-models&quot;,&quot;sections&quot;:[{&quot;local&quot;:&quot;setup&quot;,&quot;title&quot;:&quot;Setup&quot;},{&quot;local&quot;:&quot;baseline&quot;,&quot;title&quot;:&quot;Baseline&quot;},{&quot;local&quot;:&quot;running-inference-in-bfloat16&quot;,&quot;title&quot;:&quot;Running inference in bfloat16&quot;},{&quot;local&quot;:&quot;running-attention-efficiently&quot;,&quot;title&quot;:&quot;Running attention efficiently&quot;},{&quot;local&quot;:&quot;use-faster-kernels-with-torchcompile&quot;,&quot;title&quot;:&quot;Use faster kernels with torch.compile&quot;},{&quot;local&quot;:&quot;combine-the-projection-matrices-of-attention&quot;,&quot;title&quot;:&quot;Combine the projection matrices of attention&quot;},{&quot;local&quot;:&quot;dynamic-quantization&quot;,&quot;title&quot;:&quot;Dynamic quantization&quot;}],&quot;title&quot;:&quot;Accelerate inference of text-to-image diffusion models&quot;}">
<link href="/docs/diffusers/v0.25.0/pt/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/entry/start.66168f11.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/scheduler.182ea377.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/singletons.dd1a9a70.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/index.1f6d62f6.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/paths.50799595.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/entry/app.ca5b0bd9.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/index.008d68e4.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/nodes/0.fb7f344b.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/nodes/148.d7fb1e1b.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/Tip.4f096367.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/IconCopyLink.96bbb92b.js">
<link rel="modulepreload" href="/docs/diffusers/v0.25.0/pt/_app/immutable/chunks/CodeBlock.5ed6eb7b.js"><!-- HEAD_svelte-1phssyn_START --><!-- HEAD_svelte-1phssyn_END --> <h1 class="relative group"><a id="accelerate-inference-of-texttoimage-diffusion-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#accelerate-inference-of-texttoimage-diffusion-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-dvdb0g">Accelerate inference of text-to-image diffusion models</span></h1> <p data-svelte-h="svelte-pkpiyo">Diffusion models are known to be slower than their counter parts, GANs, because of the iterative and sequential reverse diffusion process. Recent works try to address limitation with:</p> <ul data-svelte-h="svelte-1n3ja6f"><li>progressive timestep distillation (such as <a href="../using-diffusers/inference_with_lcm_lora.md">LCM LoRA</a>)</li> <li>model compression (such as <a href="https://huggingface.co/segmind/SSD-1B" rel="nofollow">SSD-1B</a>)</li> <li>reusing adjacent features of the denoiser (such as <a href="https://github.com/horseee/DeepCache" rel="nofollow">DeepCache</a>)</li></ul> <p data-svelte-h="svelte-1ahq65p">In this tutorial, we focus on leveraging the power of PyTorch 2 to accelerate the inference latency of text-to-image diffusion pipeline, instead. We will use <a href="../using-diffusers/sdxl.md">Stable Diffusion XL (SDXL)</a> as a case study, but the techniques we will discuss should extend to other text-to-image diffusion pipelines.</p> <h2 class="relative group"><a id="setup" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#setup"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-1bbncge">Setup</span></h2> <p data-svelte-h="svelte-609t4e">Make sure you’re on the latest version of <code>diffusers</code>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START -->pip install -U diffusers<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1s5viqc">Then upgrade the other required libraries too:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START -->pip install -U transformers accelerate peft<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-2y23yy">To benefit from the fastest kernels, use PyTorch nightly. You can find the installation instructions <a href="https://pytorch.org/" rel="nofollow">here</a>.</p> <p data-svelte-h="svelte-zfizj4">To report the numbers shown below, we used an 80GB 400W A100 with its clock rate set to the maximum.</p> <p data-svelte-h="svelte-teht3n"><em>This tutorial doesn’t present the benchmarking code and focuses on how to perform the optimizations, instead. For the full benchmarking code, refer to: <a href="https://github.com/huggingface/diffusion-fast" rel="nofollow">https://github.com/huggingface/diffusion-fast</a>.</em></p> <h2 class="relative group"><a id="baseline" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#baseline"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-pq3mj4">Baseline</span></h2> <p data-svelte-h="svelte-zqipth">Let’s start with a baseline. Disable the use of a reduced precision and <a href="../optimization/torch2.0.md"><code>scaled_dot_product_attention</code></a>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> StableDiffusionXLPipeline
<span class="hljs-comment"># Load the pipeline in full-precision and place its model components on CUDA.</span>
pipe = StableDiffusionXLPipeline.from_pretrained(
<span class="hljs-string">&quot;stabilityai/stable-diffusion-xl-base-1.0&quot;</span>
).to(<span class="hljs-string">&quot;cuda&quot;</span>)
<span class="hljs-comment"># Run the attention ops without efficiency.</span>
pipe.unet.set_default_attn_processor()
pipe.vae.set_default_attn_processor()
prompt = <span class="hljs-string">&quot;Astronaut in a jungle, cold color palette, muted colors, detailed, 8k&quot;</span>
image = pipe(prompt, num_inference_steps=<span class="hljs-number">30</span>).images[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1x1lehr">This takes 7.36 seconds:</p> <div align="center" data-svelte-h="svelte-1do855s"><img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_0.png" width="500"></div> <h2 class="relative group"><a id="running-inference-in-bfloat16" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#running-inference-in-bfloat16"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-upf0w9">Running inference in bfloat16</span></h2> <p data-svelte-h="svelte-59kzwp">Enable the first optimization: use a reduced precision to run the inference.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> StableDiffusionXLPipeline
<span class="hljs-keyword">import</span> torch
pipe = StableDiffusionXLPipeline.from_pretrained(
<span class="hljs-string">&quot;stabilityai/stable-diffusion-xl-base-1.0&quot;</span>, torch_dtype=torch.bfloat16
).to(<span class="hljs-string">&quot;cuda&quot;</span>)
<span class="hljs-comment"># Run the attention ops without efficiency.</span>
pipe.unet.set_default_attn_processor()
pipe.vae.set_default_attn_processor()
prompt = <span class="hljs-string">&quot;Astronaut in a jungle, cold color palette, muted colors, detailed, 8k&quot;</span>
image = pipe(prompt, num_inference_steps=<span class="hljs-number">30</span>).images[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1kdjw5n">bfloat16 reduces the latency from 7.36 seconds to 4.63 seconds:</p> <div align="center" data-svelte-h="svelte-lxnqqr"><img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_1.png" width="500"></div> <p data-svelte-h="svelte-17gxfqr"><strong>Why bfloat16?</strong></p> <ul data-svelte-h="svelte-eu6tje"><li>Using a reduced numerical precision (such as float16, bfloat16) to run inference doesn’t affect the generation quality but significantly improves latency.</li> <li>The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16.</li> <li>Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.</li></ul> <p data-svelte-h="svelte-ezaz6j">We have a <a href="../optimization/fp16.md">dedicated guide</a> for running inference in a reduced precision.</p> <h2 class="relative group"><a id="running-attention-efficiently" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#running-attention-efficiently"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-bntvlu">Running attention efficiently</span></h2> <p data-svelte-h="svelte-i6cij5">Attention blocks are intensive to run. But with PyTorch’s <a href="../optimization/torch2.0.md"><code>scaled_dot_product_attention</code></a>, we can run them efficiently.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> StableDiffusionXLPipeline
<span class="hljs-keyword">import</span> torch
pipe = StableDiffusionXLPipeline.from_pretrained(
<span class="hljs-string">&quot;stabilityai/stable-diffusion-xl-base-1.0&quot;</span>, torch_dtype=torch.bfloat16
).to(<span class="hljs-string">&quot;cuda&quot;</span>)
prompt = <span class="hljs-string">&quot;Astronaut in a jungle, cold color palette, muted colors, detailed, 8k&quot;</span>
image = pipe(prompt, num_inference_steps=<span class="hljs-number">30</span>).images[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1ya4pwz"><code>scaled_dot_product_attention</code> improves the latency from 4.63 seconds to 3.31 seconds.</p> <div align="center" data-svelte-h="svelte-1vt8crq"><img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_2.png" width="500"></div> <h2 class="relative group"><a id="use-faster-kernels-with-torchcompile" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#use-faster-kernels-with-torchcompile"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-1rdd1jk">Use faster kernels with torch.compile</span></h2> <p data-svelte-h="svelte-125vukj">Compile the UNet and the VAE to benefit from the faster kernels. First, configure a few compiler flags:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> StableDiffusionXLPipeline
<span class="hljs-keyword">import</span> torch
torch._inductor.config.conv_1x1_as_mm = <span class="hljs-literal">True</span>
torch._inductor.config.coordinate_descent_tuning = <span class="hljs-literal">True</span>
torch._inductor.config.epilogue_fusion = <span class="hljs-literal">False</span>
torch._inductor.config.coordinate_descent_check_all_directions = <span class="hljs-literal">True</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-wn8tu2">For the full list of compiler flags, refer to <a href="https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py" rel="nofollow">this file</a>.</p> <p data-svelte-h="svelte-1xhp6zo">It is also important to change the memory layout of the UNet and the VAE to “channels_last” when compiling them. This ensures maximum speed:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START -->pipe.unet.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1huthp1">Then, compile and perform inference:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-comment"># Compile the UNet and VAE.</span>
pipe.unet = torch.<span class="hljs-built_in">compile</span>(pipe.unet, mode=<span class="hljs-string">&quot;max-autotune&quot;</span>, fullgraph=<span class="hljs-literal">True</span>)
pipe.vae.decode = torch.<span class="hljs-built_in">compile</span>(pipe.vae.decode, mode=<span class="hljs-string">&quot;max-autotune&quot;</span>, fullgraph=<span class="hljs-literal">True</span>)
prompt = <span class="hljs-string">&quot;Astronaut in a jungle, cold color palette, muted colors, detailed, 8k&quot;</span>
<span class="hljs-comment"># First call to `pipe` will be slow, subsequent ones will be faster.</span>
image = pipe(prompt, num_inference_steps=<span class="hljs-number">30</span>).images[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1r146mn"><code>torch.compile</code> offers different backends and modes. As we’re aiming for maximum inference speed, we opt for the inductor backend using the “max-autotune”. “max-autotune” uses CUDA graphs and optimizes the compilation graph specifically for latency. Specifying fullgraph to be True ensures that there are no graph breaks in the underlying model, ensuring the fullest potential of <code>torch.compile</code>.</p> <p data-svelte-h="svelte-ax8e93">Using SDPA attention and compiling both the UNet and VAE reduces the latency from 3.31 seconds to 2.54 seconds.</p> <div align="center" data-svelte-h="svelte-h7n375"><img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_3.png" width="500"></div> <h2 class="relative group"><a id="combine-the-projection-matrices-of-attention" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#combine-the-projection-matrices-of-attention"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-56ftf5">Combine the projection matrices of attention</span></h2> <p data-svelte-h="svelte-1ok923t">Both the UNet and the VAE used in SDXL make use of Transformer-like blocks. A Transformer block consists of attention blocks and feed-forward blocks.</p> <p data-svelte-h="svelte-1vgoh8g">In an attention block, the input is projected into three sub-spaces using three different projection matrices – Q, K, and V. In the naive implementation, these projections are performed separately on the input. But we can horizontally combine the projection matrices into a single matrix and perform the projection in one shot. This increases the size of the matmuls of the input projections and improves the impact of quantization (to be discussed next).</p> <p data-svelte-h="svelte-1g4675p">Enabling this kind of computation in Diffusers just takes a single line of code:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START -->pipe.fuse_qkv_projections()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1ip60rj">It provides a minor boost from 2.54 seconds to 2.52 seconds.</p> <div align="center" data-svelte-h="svelte-bdlltg"><img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_4.png" width="500"></div> <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"><p data-svelte-h="svelte-izvj6c">Support for <code>fuse_qkv_projections()</code> is limited and experimental. As such, it’s not available for many non-SD pipelines such as <a href="../using-diffusers/kandinsky.md">Kandinsky</a>. You can refer to <a href="https://github.com/huggingface/diffusers/pull/6179" rel="nofollow">this PR</a> to get an idea about how to support this kind of computation.</p></div> <h2 class="relative group"><a id="dynamic-quantization" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#dynamic-quantization"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span data-svelte-h="svelte-1fsz0t7">Dynamic quantization</span></h2> <p data-svelte-h="svelte-k2whf5">Aapply <a href="https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html" rel="nofollow">dynamic int8 quantization</a> to both the UNet and the VAE. This is because quantization adds additional conversion overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization). If the matmuls are too small, these techniques may degrade performance.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1xe2fym">Through experimentation, we found that certain linear layers in the UNet and the VAE don’t benefit from dynamic int8 quantization. You can check out the full code for filtering those layers <a href="https://github.com/huggingface/diffusion-fast/blob/0f169640b1db106fe6a479f78c1ed3bfaeba3386/utils/pipeline_utils.py#L16" rel="nofollow">here</a> (referred to as <code>dynamic_quant_filter_fn</code> below).</p></div> <p data-svelte-h="svelte-mctow8">You will leverage the ultra-lightweight pure PyTorch library <a href="https://github.com/pytorch-labs/ao" rel="nofollow">torchao</a> to use its user-friendly APIs for quantization.</p> <p data-svelte-h="svelte-gvi9jq">First, configure all the compiler tags:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> diffusers <span class="hljs-keyword">import</span> StableDiffusionXLPipeline
<span class="hljs-keyword">import</span> torch
<span class="hljs-comment"># Notice the two new flags at the end.</span>
torch._inductor.config.conv_1x1_as_mm = <span class="hljs-literal">True</span>
torch._inductor.config.coordinate_descent_tuning = <span class="hljs-literal">True</span>
torch._inductor.config.epilogue_fusion = <span class="hljs-literal">False</span>
torch._inductor.config.coordinate_descent_check_all_directions = <span class="hljs-literal">True</span>
torch._inductor.config.force_fuse_int_mm_with_mul = <span class="hljs-literal">True</span>
torch._inductor.config.use_mixed_mm = <span class="hljs-literal">True</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-tgt4ub">Define the filtering functions:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">dynamic_quant_filter_fn</span>(<span class="hljs-params">mod, *args</span>):
<span class="hljs-keyword">return</span> (
<span class="hljs-built_in">isinstance</span>(mod, torch.nn.Linear)
<span class="hljs-keyword">and</span> mod.in_features &gt; <span class="hljs-number">16</span>
<span class="hljs-keyword">and</span> (mod.in_features, mod.out_features)
<span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> [
(<span class="hljs-number">1280</span>, <span class="hljs-number">640</span>),
(<span class="hljs-number">1920</span>, <span class="hljs-number">1280</span>),
(<span class="hljs-number">1920</span>, <span class="hljs-number">640</span>),
(<span class="hljs-number">2048</span>, <span class="hljs-number">1280</span>),
(<span class="hljs-number">2048</span>, <span class="hljs-number">2560</span>),
(<span class="hljs-number">2560</span>, <span class="hljs-number">1280</span>),
(<span class="hljs-number">256</span>, <span class="hljs-number">128</span>),
(<span class="hljs-number">2816</span>, <span class="hljs-number">1280</span>),
(<span class="hljs-number">320</span>, <span class="hljs-number">640</span>),
(<span class="hljs-number">512</span>, <span class="hljs-number">1536</span>),
(<span class="hljs-number">512</span>, <span class="hljs-number">256</span>),
(<span class="hljs-number">512</span>, <span class="hljs-number">512</span>),
(<span class="hljs-number">640</span>, <span class="hljs-number">1280</span>),
(<span class="hljs-number">640</span>, <span class="hljs-number">1920</span>),
(<span class="hljs-number">640</span>, <span class="hljs-number">320</span>),
(<span class="hljs-number">640</span>, <span class="hljs-number">5120</span>),
(<span class="hljs-number">640</span>, <span class="hljs-number">640</span>),
(<span class="hljs-number">960</span>, <span class="hljs-number">320</span>),
(<span class="hljs-number">960</span>, <span class="hljs-number">640</span>),
]
)
<span class="hljs-keyword">def</span> <span class="hljs-title function_">conv_filter_fn</span>(<span class="hljs-params">mod, *args</span>):
<span class="hljs-keyword">return</span> (
<span class="hljs-built_in">isinstance</span>(mod, torch.nn.Conv2d) <span class="hljs-keyword">and</span> mod.kernel_size == (<span class="hljs-number">1</span>, <span class="hljs-number">1</span>) <span class="hljs-keyword">and</span> <span class="hljs-number">128</span> <span class="hljs-keyword">in</span> [mod.in_channels, mod.out_channels]
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1848xq7">Then apply all the optimizations discussed so far:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-comment"># SDPA + bfloat16.</span>
pipe = StableDiffusionXLPipeline.from_pretrained(
<span class="hljs-string">&quot;stabilityai/stable-diffusion-xl-base-1.0&quot;</span>, torch_dtype=torch.bfloat16
).to(<span class="hljs-string">&quot;cuda&quot;</span>)
<span class="hljs-comment"># Combine attention projection matrices.</span>
pipe.fuse_qkv_projections()
<span class="hljs-comment"># Change the memory layout.</span>
pipe.unet.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1nbn6cl">Since this quantization support is limited to linear layers only, we also turn suitable pointwise convolution layers into linear layers to maximize the benefit.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> torchao <span class="hljs-keyword">import</span> swap_conv2d_1x1_to_linear
swap_conv2d_1x1_to_linear(pipe.unet, conv_filter_fn)
swap_conv2d_1x1_to_linear(pipe.vae, conv_filter_fn)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-13vdbpw">Apply dynamic quantization:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> torchao <span class="hljs-keyword">import</span> apply_dynamic_quant
apply_dynamic_quant(pipe.unet, dynamic_quant_filter_fn)
apply_dynamic_quant(pipe.vae, dynamic_quant_filter_fn)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-145l8jb">Finally, compile and perform inference:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre><!-- HTML_TAG_START -->pipe.unet = torch.<span class="hljs-built_in">compile</span>(pipe.unet, mode=<span class="hljs-string">&quot;max-autotune&quot;</span>, fullgraph=<span class="hljs-literal">True</span>)
pipe.vae.decode = torch.<span class="hljs-built_in">compile</span>(pipe.vae.decode, mode=<span class="hljs-string">&quot;max-autotune&quot;</span>, fullgraph=<span class="hljs-literal">True</span>)
prompt = <span class="hljs-string">&quot;Astronaut in a jungle, cold color palette, muted colors, detailed, 8k&quot;</span>
image = pipe(prompt, num_inference_steps=<span class="hljs-number">30</span>).images[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-od9x3v">Applying dynamic quantization improves the latency from 2.52 seconds to 2.43 seconds.</p> <div align="center" data-svelte-h="svelte-j96t4n"><img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_5.png" width="500"></div>
<script>
{
__sveltekit_1aw1dqo = {
assets: "/docs/diffusers/v0.25.0/pt",
base: "/docs/diffusers/v0.25.0/pt",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/diffusers/v0.25.0/pt/_app/immutable/entry/start.66168f11.js"),
import("/docs/diffusers/v0.25.0/pt/_app/immutable/entry/app.ca5b0bd9.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 148],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
49.1 kB
·
Xet hash:
25731ec73eafd13c81a2c3b3be7df90315eb00d61e8b1b99386742a2c9e2a050

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.