| # auto-diffusers-docs | |
| Still a WIP. Use an LLM to generate reasonable code snippets in a hardware-aware manner for Diffusers. | |
| ### Motivation | |
| Within the Diffusers, we support a bunch of optimization techniques (refer [here](https://huggingface.co/docs/diffusers/main/en/optimization/memory), [here](https://huggingface.co/docs/diffusers/main/en/optimization/cache), and [here](https://huggingface.co/docs/diffusers/main/en/optimization/fp16)). However, it can be | |
| daunting for our users to determine when to use what. Hence, this repository tries to take a stab | |
| at using an LLM to generate reasonable code snippets for a given pipeline checkpoint that respects | |
| user hardware configuration. | |
| ## Getting started | |
| Install the requirements from `requirements.txt`. | |
| Configure `GOOGLE_API_KEY` in the environment: `export GOOGLE_API_KEY=...`. | |
| Then run: | |
| ```bash | |
| python e2e_example.py | |
| ``` | |
| By default, the `e2e_example.py` script uses Flux.1-Dev, but this can be configured through the `--ckpt_id` argument. | |
| Full usage: | |
| ```sh | |
| usage: e2e_example.py [-h] [--ckpt_id CKPT_ID] [--gemini_model GEMINI_MODEL] [--variant VARIANT] [--enable_lossy] | |
| options: | |
| -h, --help show this help message and exit | |
| --ckpt_id CKPT_ID Can be a repo id from the Hub or a local path where the checkpoint is stored. | |
| --gemini_model GEMINI_MODEL | |
| Gemini model to use. Choose from https://ai.google.dev/gemini-api/docs/models. | |
| --variant VARIANT If the `ckpt_id` has variants, supply this flag to estimate compute. Example: 'fp16'. | |
| --enable_lossy When enabled, the code will include snippets for enabling quantization. | |
| ``` | |
| ## Example outputs | |
| <details> | |
| <summary>python e2e_example.py (ran on an H100)</summary> | |
| ````sh | |
| System RAM: 1999.99 GB | |
| RAM Category: large | |
| GPU VRAM: 79.65 GB | |
| VRAM Category: large | |
| current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: False\nenable_torch_compile: True\n' | |
| Sending request to Gemini... | |
| ```python | |
| from diffusers import DiffusionPipeline | |
| import torch | |
| # User-provided information: | |
| # pipeline_loading_memory_GB: 31.424 | |
| # available_system_ram_GB: 1999.9855346679688 (Large RAM) | |
| # available_gpu_vram_GB: 79.6474609375 (Large VRAM) | |
| # enable_lossy_outputs: False | |
| # enable_torch_compile: True | |
| # --- Configuration based on user needs and system capabilities --- | |
| # Placeholder for the actual checkpoint ID | |
| # Please replace this with your desired model checkpoint ID. | |
| CKPT_ID = "black-forest-labs/FLUX.1-dev" | |
| # Determine dtype. bfloat16 is generally recommended for performance on compatible GPUs. | |
| # Ensure your GPU supports bfloat16 for optimal performance. | |
| dtype = torch.bfloat16 | |
| # 1. Pipeline Loading and Device Placement: | |
| # Available VRAM (79.64 GB) is significantly greater than the pipeline's loading memory (31.42 GB). | |
| # Therefore, the entire pipeline can comfortably fit and run on the GPU. | |
| print(f"Loading pipeline '{CKPT_ID}' with {dtype} precision...") | |
| pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=dtype) | |
| print("Moving pipeline to CUDA (GPU) as VRAM is sufficient...") | |
| pipe = pipe.to("cuda") | |
| # 2. Quantization: | |
| # User specified `enable_lossy_outputs: False`, so no quantization is applied. | |
| print("Quantization is NOT applied as per user's preference for lossless outputs.") | |
| # 3. Torch Compile: | |
| # User specified `enable_torch_compile: True`. | |
| # Since no offloading was applied (the entire model is on GPU), we can use `fullgraph=True` | |
| # for potentially greater performance benefits. | |
| print("Applying torch.compile() to the transformer for accelerated inference...") | |
| # The transformer is typically the most compute-intensive part of the diffusion pipeline. | |
| # Compiling it can lead to significant speedups. | |
| pipe.transformer.compile(fullgraph=True) | |
| # --- Inference --- | |
| print("Starting inference...") | |
| prompt = "photo of a dog sitting beside a river, high quality, 4k" | |
| image = pipe(prompt).images[0] | |
| print("Inference completed. Displaying image.") | |
| # Save or display the image | |
| image.save("generated_image.png") | |
| print("Image saved as generated_image.png") | |
| # You can also display the image directly if running in an environment that supports it | |
| # image.show() | |
| ``` | |
| ```` | |
| <br> | |
| </details> | |
| <br> | |
| <details> | |
| <summary>python e2e_example.py --enable_lossy</summary> | |
| ````sh | |
| System RAM: 1999.99 GB | |
| RAM Category: large | |
| GPU VRAM: 79.65 GB | |
| VRAM Category: large | |
| current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: True\nenable_torch_compile: True\n' | |
| Sending request to Gemini... | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| from diffusers.quantizers import PipelineQuantizationConfig | |
| import os | |
| # --- User-provided information and derived constants --- | |
| # Checkpoint ID (assuming a placeholder since it was not provided in the user input) | |
| # Using the example CKPT_ID from the problem description | |
| CKPT_ID = "black-forest-labs/FLUX.1-dev" | |
| # Derived from available_gpu_vram_GB (79.64 GB) and pipeline_loading_memory_GB (31.424 GB) | |
| # VRAM is ample to load the entire pipeline | |
| use_cuda_direct_load = True | |
| # Derived from enable_lossy_outputs (True) | |
| enable_quantization = True | |
| # Derived from enable_torch_compile (True) | |
| enable_torch_compile = True | |
| # --- Inference Code --- | |
| print(f"Loading pipeline: {CKPT_ID}") | |
| # 1. Quantization Configuration (since enable_lossy_outputs is True) | |
| quant_config = None | |
| if enable_quantization: | |
| # Default to bitsandbytes 4-bit as per guidance | |
| print("Enabling bitsandbytes 4-bit quantization for 'transformer' component.") | |
| quant_config = PipelineQuantizationConfig( | |
| quant_backend="bitsandbytes_4bit", | |
| quant_kwargs={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"}, | |
| # For FLUX.1-dev, the main generative component is typically 'transformer'. | |
| # For other pipelines, you might include 'unet', 'text_encoder', 'text_encoder_2', etc. | |
| components_to_quantize=["transformer"] | |
| ) | |
| # 2. Load the Diffusion Pipeline | |
| # Use bfloat16 for better performance and modern GPU compatibility | |
| pipe = DiffusionPipeline.from_pretrained( | |
| CKPT_ID, | |
| torch_dtype=torch.bfloat16, | |
| quantization_config=quant_config if enable_quantization else None | |
| ) | |
| # 3. Move Pipeline to GPU (since VRAM is ample) | |
| if use_cuda_direct_load: | |
| print("Moving the entire pipeline to CUDA (GPU).") | |
| pipe = pipe.to("cuda") | |
| # 4. Apply torch.compile() (since enable_torch_compile is True) | |
| if enable_torch_compile: | |
| print("Applying torch.compile() for speedup.") | |
| # This setting is beneficial when bitsandbytes is used | |
| torch._dynamo.config.capture_dynamic_output_shape_ops = True | |
| # Since no offloading is applied (model fits fully in VRAM), use fullgraph=True | |
| # The primary component for compilation in FLUX.1-dev is 'transformer' | |
| print("Compiling pipe.transformer with fullgraph=True.") | |
| pipe.transformer = torch.compile(pipe.transformer, fullgraph=True) | |
| # 5. Perform Inference | |
| print("Starting image generation...") | |
| prompt = "photo of a dog sitting beside a river" | |
| num_inference_steps = 28 # A reasonable number of steps for good quality | |
| # Ensure all inputs are on the correct device for inference after compilation | |
| with torch.no_grad(): | |
| image = pipe(prompt, num_inference_steps=num_inference_steps).images[0] | |
| print("Image generation complete.") | |
| # Save or display the image | |
| output_path = "generated_image.png" | |
| image.save(output_path) | |
| print(f"Image saved to {output_path}") | |
| ``` | |
| ```` | |
| </details> | |
| <br> | |
| When invoked from an RTX 4090, it outputs: | |
| <details> | |
| <summary>Expand</summary> | |
| ````sh | |
| System RAM: 125.54 GB | |
| RAM Category: large | |
| GPU VRAM: 23.99 GB | |
| VRAM Category: medium | |
| current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 125.54026794433594\navailable_gpu_vram_GB: 23.98828125\nenable_lossy_outputs: False\nenable_torch_compile: True\n' | |
| Sending request to Gemini... | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| import os # For creating offload directories if needed, though not directly used in this solution | |
| # --- User-provided information (interpreted) --- | |
| # Checkpoint ID will be a placeholder as it's not provided by the user directly in the input. | |
| # pipeline_loading_memory_GB: 31.424 GB | |
| # available_system_ram_GB: 125.54 GB (Categorized as "large": > 40GB) | |
| # available_gpu_vram_GB: 23.98 GB (Categorized as "medium": > 8GB <= 24GB) | |
| # enable_lossy_outputs: False (User prefers no quantization) | |
| # enable_torch_compile: True (User wants to enable torch.compile) | |
| # --- Configuration --- | |
| # Placeholder for the actual checkpoint ID. Replace with the desired model ID. | |
| CKPT_ID = "black-forest-labs/FLUX.1-dev" # Example from Diffusers library. | |
| PROMPT = "photo of a dog sitting beside a river" | |
| print(f"--- Optimizing inference for CKPT_ID: {CKPT_ID} ---") | |
| print(f"Pipeline loading memory: {31.424} GB") | |
| print(f"Available System RAM: {125.54} GB (Large)") | |
| print(f"Available GPU VRAM: {23.98} GB (Medium)") | |
| print(f"Lossy outputs (quantization): {'Disabled' if not False else 'Enabled'}") | |
| print(f"Torch.compile: {'Enabled' if True else 'Disabled'}") | |
| print("-" * 50) | |
| # --- 1. Load the Diffusion Pipeline --- | |
| # Use bfloat16 for a good balance of memory and performance. | |
| print(f"Loading pipeline '{CKPT_ID}' with torch_dtype=torch.bfloat16...") | |
| pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=torch.bfloat16) | |
| print("Pipeline loaded.") | |
| # --- 2. Apply Memory Optimizations --- | |
| # Analysis: | |
| # - Pipeline memory (31.424 GB) exceeds available GPU VRAM (23.98 GB). | |
| # - System RAM (125.54 GB) is large. | |
| # Strategy: Use `enable_model_cpu_offload()`. This moves model components to CPU when not | |
| # in use, swapping them to GPU on demand. This is ideal when VRAM is insufficient but system | |
| # RAM is abundant. | |
| print("Applying memory optimization: `pipe.enable_model_cpu_offload()`...") | |
| pipe.enable_model_cpu_offload() | |
| print("Model CPU offloading enabled. Components will dynamically move between CPU and GPU.") | |
| # --- 3. Apply Speed Optimizations (torch.compile) --- | |
| # Analysis: | |
| # - `enable_torch_compile` is True. | |
| # - Model offloading (`enable_model_cpu_offload`) is applied. | |
| # Strategy: Enable torch.compile with `recompile_limit` as offloading is used. | |
| # Do not use `fullgraph=True` when offloading is active. | |
| print("Applying speed optimization: `torch.compile()`...") | |
| torch._dynamo.config.recompile_limit = 1000 # Recommended when offloading is applied. | |
| # torch._dynamo.config.capture_dynamic_output_shape_ops = True # Only for bitsandbytes, not applicable here. | |
| # Compile the main computational component (e.g., transformer or unet). | |
| # FLUX models primarily use a transformer. For other models, it might be `pipe.unet`. | |
| if hasattr(pipe, "transformer"): | |
| print("Compiling `pipe.transformer`...") | |
| pipe.transformer.compile() | |
| elif hasattr(pipe, "unet"): | |
| print("Compiling `pipe.unet`...") | |
| pipe.unet.compile() | |
| else: | |
| print("Warning: Neither `pipe.transformer` nor `pipe.unet` found for compilation. Skipping `torch.compile` for core component.") | |
| print("Speed optimizations applied.") | |
| # --- 4. Perform Inference --- | |
| print(f"Starting image generation for prompt: '{PROMPT}'") | |
| # The pipeline handles component swapping automatically due to `enable_model_cpu_offload()`. | |
| image = pipe(PROMPT).images[0] | |
| print("Image generation complete.") | |
| # --- 5. Save or Display the Result --- | |
| output_path = "generated_image.png" | |
| image.save(output_path) | |
| print(f"Generated image saved to '{output_path}'") | |
| print("\n--- Inference process finished successfully ---") | |
| ``` | |
| ```` | |
| </details> | |
| ### More outputs | |
| <details> | |
| <summary>"Wan-AI/Wan2.1-T2V-14B-Diffusers" with lossy outputs enabled</summary> | |
| ````sh | |
| System RAM: 125.54 GB | |
| RAM Category: large | |
| GPU VRAM: 23.99 GB | |
| VRAM Category: medium | |
| ("current_generate_prompt='\\nckpt_id: " | |
| 'Wan-AI/Wan2.1-T2V-14B-Diffusers\\npipeline_loading_memory_GB: ' | |
| '37.432\\navailable_system_ram_GB: ' | |
| '125.54026794433594\\navailable_gpu_vram_GB: ' | |
| '23.98828125\\nenable_lossy_outputs: True\\nis_fp8_supported: ' | |
| "True\\nenable_torch_compile: True\\n'") | |
| Sending request to Gemini... | |
| ```python | |
| from diffusers import DiffusionPipeline | |
| from diffusers.quantizers import PipelineQuantizationConfig | |
| import torch | |
| ckpt_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers" | |
| quant_config = PipelineQuantizationConfig( | |
| quant_backend="torchao", | |
| quant_kwargs={"quant_type": "float8dq_e4m3_row"}, | |
| components_to_quantize=["transformer"] | |
| ) | |
| pipe = DiffusionPipeline.from_pretrained(ckpt_id, quantization_config=quant_config, torch_dtype=torch.bfloat16) | |
| # Apply model CPU offload due to VRAM constraints | |
| pipe.enable_model_cpu_offload() | |
| # torch.compile() configuration | |
| torch._dynamo.config.recompile_limit = 1000 | |
| pipe.transformer.compile() | |
| # pipe.vae.decode = torch.compile(pipe.vae.decode) # Uncomment if you want to compile VAE decode as well | |
| prompt = "photo of a dog sitting beside a river" | |
| # Modify the pipe call arguments as needed. | |
| image = pipe(prompt).images[0] | |
| # You can save the image or perform further operations here | |
| # image.save("generated_image.png") | |
| ``` | |
| ```` | |
| </details> | |
| <small>Ran on an RTX 4090</small> |