| | Stable Diffusion XL Int8 Quantization |
| | ======================================= |
| |
|
| | This example shows how to use ModelOpt to calibrate and quantize the UNet part of the SDXL. The UNet part typically consumes |
| | >95% of the e2e Stable Diffusion latency. |
| |
|
| | We also provide instructions on deploying and running E2E SDXL pipeline |
| | with ModelOpt quantized int8 UNet to generate images and measure latency on target GPUs. |
| |
|
| | To get started, you need a pretrained SDXL checkpoint in NeMo format. Example training configurations are available here: Stable Diffusion Examples `<https://github.com/NVIDIA/NeMo/tree/main/examples/multimodal/text_to_image/stable_diffusion>`_. |
| |
|
| |
|
| | Calibration |
| | |
| | The first step is to run the quantization script with the default config. The script will export the quantized unet to the onnx file. |
| | Here is the default config for SDXL Quantize Script: `<https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/text_to_image/stable_diffusion/sd_xl_quantize.py>`_. |
| |
|
| |
|
| | .. code-block:: yaml |
| |
|
| | quantize |
| | exp_name: nemo |
| | n_steps: 20 # number of inference steps |
| | format: 'int8' # only int8 quantization is supported now |
| | percentile: 1.0 # Control quantization scaling factors (amax) collecting range, meaning that we will collect the minimum amax in the range of `(n_steps * percentile)` steps. Recommendation: 1.0 |
| | batch_size: 1 # batch size calling sdxl inference pipeline during calibration |
| | calib_size: 32 # For SDXL, we recommend 32, 64 or 128 |
| | quant_level: 2.5 #Which layers to be quantized, 1: `CNNs`, 2: `CNN + FFN`, 2.5: `CNN + FFN + QKV`, 3: `CNN + Linear`. Recommendation: 2, 2.5 and 3, depending on the requirements for image quality & speedup. |
| | alpha: 0.8 # A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL |
| |
|
| |
|
| |
|
| | Important Parameters |
| | ^^^^^^^^^^^^^^^^^^^^ |
| | - percentile: Controls the range for collecting quantization scaling factors (amax). This means we will collect the minimum amax over a range of (n_steps * percentile) steps. We recommend 1.0. |
| | - alpha: A parameter in SmoothQuant, used exclusively for linear layers only. We recommend 0.8 for SDXL and 1.0 for SD 1.5. |
| | - quant-level: Which layers to be quantized, 1: CNNs, 2: CNN + FFN, 2.5: CNN + FFN + QKV, 3: CNN + Linear. Recommendation: 2, 2.5 and 3, depending on the requirements for image quality & speedup. |
| | - calib-size: For SDXL, we recommend 32, 64 or 128, for SD 1.5 and set to 512 or 1024. |
| |
|
| |
|
| | Build the TRT engine for the Quantized ONNX UNet |
| | |
| |
|
| | .. code-block:: bash |
| |
|
| | trtexec |
| |
|
| |
|
| | Important Parameters |
| | ^^^^^^^^^^^^^^^^^^^^ |
| | When building the TRT engine, you must provide the input shape as follows: |
| |
|
| | - x: Input image latent shape (B * C * H * W) |
| | - context: Input text conditioning (B * S * hidden_dimension) |
| | - y: Additional embedding (B * adm_in_channels) |
| |
|
| | Build End-to-end Stable Diffusion XL Pipeline with NeMo |
| | |
| |
|
| | We provide a script to build an end-to-end TRT inference pipeline with the NeMo backend here: SDXL Export Script `<https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/text_to_image/stable_diffusion/sd_xl_export.py>`_. |
| |
|
| | .. code-block:: yaml |
| |
|
| | infer: |
| | out_path: sdxl_export |
| | width: 1024 |
| | height: 1024 |
| | batch_size: 2 |
| |
|
| | trt: |
| | static_batch: False |
| | min_batch_size: 1 |
| | max_batch_size: 8 |
| |
|
| | Important Parameters |
| | ^^^^^^^^^^^^^^^^^^^^ |
| | - out_path: Directory to save onnx file and TRT engine files. |
| | - width and height: Image resolution of inference output. |
| | - batch_size: Only used for dummy input generation and onnx sanity check. |
| | - {min,max}_batch_size: The input batch size of TRT engine along its dynamic axis. |
| |
|
| |
|
| | Run End-to-End Stable Diffusion XL TRT Pipeline |
| | |
| |
|
| | The inference script can be found here: SDXL TRT Inference Script `<https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/text_to_image/stable_diffusion/sd_xl_trt_inference.py>`_. |
| |
|
| | .. code-block:: yaml |
| |
|
| | unet_xl: sdxl_export/plan/unet_xl.plan |
| | vae: sdxl_export/plan/vae.plan |
| | clip1: sdxl_export/plan/clip1.plan |
| | clip2: sdxl_export/plan/clip2.plan |
| |
|
| | out_path: trt_output |
| |
|
| |
|
| | Please specify unet_xl as the quantized Unet engine to run the quantized solution. The system will load the original engine file by default. |
| |
|
| | Inference Speedup |
| | |
| | TRT version 9.3.0 |
| | GPU: H100 |
| |
|
| | TRT int8 vs Framework fp16 |
| | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
|
| | + |
| | | Pipeline | Batch Size | Latency (ms)| Pipeline | Batch Size | Latency | Speedup | |
| | +=====================+============+=============+====================+============+=========+============+ |
| | | Framework fp16 base | 1 | 3056.01 | ModelOpt TRT Int8 | 1 | 1406.68 | 2.172498365| |
| | + |
| | | Framework fp16 base | 2 | 4832.24 | ModelOpt TRT Int8 | 2 | 2403.29 | 2.01067703 | |
| | + |
| | | Framework fp16 base | 4 | 8433.71 | ModelOpt TRT Int8 | 4 | 4252.6 | 1.983189108| |
| | + |
| |
|
| |
|
| |
|
| | TRT int8 vs TRT fp16 |
| | ^^^^^^^^^^^^^^^^^^^^^^^ |
| |
|
| |
|
| | + |
| | | Pipeline | Batch Size | Latency (ms) | Precision | Batch Size | Latency | Speedup | |
| | +=============+============+==============+===============+============+============+=============+ |
| | | fp16 base | 1 | 1723.97 | ModelOpt Int8 | 1 | 1406.68 | 1.225559473 | |
| | + |
| | | fp16 base | 2 | 3004.47 | ModelOpt Int8 | 2 | 2403.29 | 1.250148754 | |
| | + |
| | | fp16 base | 4 | 5657.19 | ModelOpt Int8 | 4 | 4252.6 | 1.330289705 | |
| | + |
| |
|
| |
|
| | FP16 inference vs Int8 inference |
| | |
| |
|
| | .. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_fp16_1.png |
| | :width: 50% |
| | .. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_int8_1.png |
| | :width: 50% |
| |
|
| | Prompt: A photo of a Shiba Inu dog with a backpack riding a bike. It is wearing sunglasses and a beach hat. (FP16 upper vs Int8 lower) |
| |
|
| | .. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_fp16_2.png |
| | :width: 50% |
| | .. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_int8_2.png |
| | :width: 50% |
| |
|
| | Prompt: A cute corgi lives in a house made out of sushi. (FP16 upper vs Int8 lower) |
| |
|
| | .. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_fp16_3.png |
| | :width: 50% |
| | .. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_int8_3.png |
| | :width: 50% |
| |
|
| | Prompt: A high contrast portrait of a very happy fuzzy panda dressed as a chef in a high end kitchen making dough. There is a painting of flowers on the wall behind him. (FP16 upper vs Int8 lower) |
| |
|
| |
|