Spaces:

diffusers
/

optimized-diffusers-code

Running

App Files Files Community

optimized-diffusers-code / README.md

sayakpaul HF Staff

Sync from GitHub

ec3f4e3 verified 6 months ago

preview code

raw

history blame

13.2 kB

	# auto-diffusers-docs

	Still a WIP. Use an LLM to generate reasonable code snippets in a hardware-aware manner for Diffusers.

	### Motivation

	Within the Diffusers, we support a bunch of optimization techniques (refer [here](https://huggingface.co/docs/diffusers/main/en/optimization/memory), [here](https://huggingface.co/docs/diffusers/main/en/optimization/cache), and [here](https://huggingface.co/docs/diffusers/main/en/optimization/fp16)). However, it can be
	daunting for our users to determine when to use what. Hence, this repository tries to take a stab
	at using an LLM to generate reasonable code snippets for a given pipeline checkpoint that respects
	user hardware configuration.

	## Getting started

	Install the requirements from `requirements.txt`.

	Configure `GOOGLE_API_KEY` in the environment: `export GOOGLE_API_KEY=...`.

	Then run:

	```bash
	python e2e_example.py
	```

	By default, the `e2e_example.py` script uses Flux.1-Dev, but this can be configured through the `--ckpt_id` argument.

	Full usage:

	```sh
	usage: e2e_example.py [-h] [--ckpt_id CKPT_ID] [--gemini_model GEMINI_MODEL] [--variant VARIANT] [--enable_lossy]

	options:
	-h, --help show this help message and exit
	--ckpt_id CKPT_ID Can be a repo id from the Hub or a local path where the checkpoint is stored.
	--gemini_model GEMINI_MODEL
	Gemini model to use. Choose from https://ai.google.dev/gemini-api/docs/models.
	--variant VARIANT If the `ckpt_id` has variants, supply this flag to estimate compute. Example: 'fp16'.
	--enable_lossy When enabled, the code will include snippets for enabling quantization.
	```

	## Example outputs

	<details>
	<summary>python e2e_example.py (ran on an H100)</summary>

	````sh
	System RAM: 1999.99 GB
	RAM Category: large

	GPU VRAM: 79.65 GB
	VRAM Category: large
	current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: False\nenable_torch_compile: True\n'
	Sending request to Gemini...
	```python
	from diffusers import DiffusionPipeline
	import torch

	# User-provided information:
	# pipeline_loading_memory_GB: 31.424
	# available_system_ram_GB: 1999.9855346679688 (Large RAM)
	# available_gpu_vram_GB: 79.6474609375 (Large VRAM)
	# enable_lossy_outputs: False
	# enable_torch_compile: True

	# --- Configuration based on user needs and system capabilities ---

	# Placeholder for the actual checkpoint ID
	# Please replace this with your desired model checkpoint ID.
	CKPT_ID = "black-forest-labs/FLUX.1-dev"

	# Determine dtype. bfloat16 is generally recommended for performance on compatible GPUs.
	# Ensure your GPU supports bfloat16 for optimal performance.
	dtype = torch.bfloat16

	# 1. Pipeline Loading and Device Placement:
	# Available VRAM (79.64 GB) is significantly greater than the pipeline's loading memory (31.42 GB).
	# Therefore, the entire pipeline can comfortably fit and run on the GPU.
	print(f"Loading pipeline '{CKPT_ID}' with {dtype} precision...")
	pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=dtype)

	print("Moving pipeline to CUDA (GPU) as VRAM is sufficient...")
	pipe = pipe.to("cuda")

	# 2. Quantization:
	# User specified `enable_lossy_outputs: False`, so no quantization is applied.
	print("Quantization is NOT applied as per user's preference for lossless outputs.")

	# 3. Torch Compile:
	# User specified `enable_torch_compile: True`.
	# Since no offloading was applied (the entire model is on GPU), we can use `fullgraph=True`
	# for potentially greater performance benefits.
	print("Applying torch.compile() to the transformer for accelerated inference...")
	# The transformer is typically the most compute-intensive part of the diffusion pipeline.
	# Compiling it can lead to significant speedups.
	pipe.transformer.compile(fullgraph=True)

	# --- Inference ---
	print("Starting inference...")
	prompt = "photo of a dog sitting beside a river, high quality, 4k"
	image = pipe(prompt).images[0]

	print("Inference completed. Displaying image.")
	# Save or display the image
	image.save("generated_image.png")
	print("Image saved as generated_image.png")

	# You can also display the image directly if running in an environment that supports it
	# image.show()
	```
	````
	<br>
	</details>
	<br>
	<details>
	<summary>python e2e_example.py --enable_lossy</summary>

	````sh
	System RAM: 1999.99 GB
	RAM Category: large

	GPU VRAM: 79.65 GB
	VRAM Category: large
	current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 1999.9855346679688\navailable_gpu_vram_GB: 79.6474609375\nenable_lossy_outputs: True\nenable_torch_compile: True\n'
	Sending request to Gemini...
	```python
	import torch
	from diffusers import DiffusionPipeline
	from diffusers.quantizers import PipelineQuantizationConfig
	import os

	# --- User-provided information and derived constants ---
	# Checkpoint ID (assuming a placeholder since it was not provided in the user input)
	# Using the example CKPT_ID from the problem description
	CKPT_ID = "black-forest-labs/FLUX.1-dev"

	# Derived from available_gpu_vram_GB (79.64 GB) and pipeline_loading_memory_GB (31.424 GB)
	# VRAM is ample to load the entire pipeline
	use_cuda_direct_load = True

	# Derived from enable_lossy_outputs (True)
	enable_quantization = True

	# Derived from enable_torch_compile (True)
	enable_torch_compile = True

	# --- Inference Code ---

	print(f"Loading pipeline: {CKPT_ID}")

	# 1. Quantization Configuration (since enable_lossy_outputs is True)
	quant_config = None
	if enable_quantization:
	# Default to bitsandbytes 4-bit as per guidance
	print("Enabling bitsandbytes 4-bit quantization for 'transformer' component.")
	quant_config = PipelineQuantizationConfig(
	quant_backend="bitsandbytes_4bit",
	quant_kwargs={"load_in_4bit": True, "bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"},
	# For FLUX.1-dev, the main generative component is typically 'transformer'.
	# For other pipelines, you might include 'unet', 'text_encoder', 'text_encoder_2', etc.
	components_to_quantize=["transformer"]
	)

	# 2. Load the Diffusion Pipeline
	# Use bfloat16 for better performance and modern GPU compatibility
	pipe = DiffusionPipeline.from_pretrained(
	CKPT_ID,
	torch_dtype=torch.bfloat16,
	quantization_config=quant_config if enable_quantization else None
	)

	# 3. Move Pipeline to GPU (since VRAM is ample)
	if use_cuda_direct_load:
	print("Moving the entire pipeline to CUDA (GPU).")
	pipe = pipe.to("cuda")

	# 4. Apply torch.compile() (since enable_torch_compile is True)
	if enable_torch_compile:
	print("Applying torch.compile() for speedup.")
	# This setting is beneficial when bitsandbytes is used
	torch._dynamo.config.capture_dynamic_output_shape_ops = True

	# Since no offloading is applied (model fits fully in VRAM), use fullgraph=True
	# The primary component for compilation in FLUX.1-dev is 'transformer'
	print("Compiling pipe.transformer with fullgraph=True.")
	pipe.transformer = torch.compile(pipe.transformer, fullgraph=True)

	# 5. Perform Inference
	print("Starting image generation...")
	prompt = "photo of a dog sitting beside a river"
	num_inference_steps = 28 # A reasonable number of steps for good quality

	# Ensure all inputs are on the correct device for inference after compilation
	with torch.no_grad():
	image = pipe(prompt, num_inference_steps=num_inference_steps).images[0]

	print("Image generation complete.")
	# Save or display the image
	output_path = "generated_image.png"
	image.save(output_path)
	print(f"Image saved to {output_path}")

	```
	````

	</details>
	<br>
	When invoked from an RTX 4090, it outputs:

	<details>
	<summary>Expand</summary>

	````sh
	System RAM: 125.54 GB
	RAM Category: large

	GPU VRAM: 23.99 GB
	VRAM Category: medium
	current_generate_prompt='\npipeline_loading_memory_GB: 31.424\navailable_system_ram_GB: 125.54026794433594\navailable_gpu_vram_GB: 23.98828125\nenable_lossy_outputs: False\nenable_torch_compile: True\n'
	Sending request to Gemini...
	```python
	import torch
	from diffusers import DiffusionPipeline
	import os # For creating offload directories if needed, though not directly used in this solution

	# --- User-provided information (interpreted) ---
	# Checkpoint ID will be a placeholder as it's not provided by the user directly in the input.
	# pipeline_loading_memory_GB: 31.424 GB
	# available_system_ram_GB: 125.54 GB (Categorized as "large": > 40GB)
	# available_gpu_vram_GB: 23.98 GB (Categorized as "medium": > 8GB <= 24GB)
	# enable_lossy_outputs: False (User prefers no quantization)
	# enable_torch_compile: True (User wants to enable torch.compile)

	# --- Configuration ---
	# Placeholder for the actual checkpoint ID. Replace with the desired model ID.
	CKPT_ID = "black-forest-labs/FLUX.1-dev" # Example from Diffusers library.
	PROMPT = "photo of a dog sitting beside a river"

	print(f"--- Optimizing inference for CKPT_ID: {CKPT_ID} ---")
	print(f"Pipeline loading memory: {31.424} GB")
	print(f"Available System RAM: {125.54} GB (Large)")
	print(f"Available GPU VRAM: {23.98} GB (Medium)")
	print(f"Lossy outputs (quantization): {'Disabled' if not False else 'Enabled'}")
	print(f"Torch.compile: {'Enabled' if True else 'Disabled'}")
	print("-" * 50)

	# --- 1. Load the Diffusion Pipeline ---
	# Use bfloat16 for a good balance of memory and performance.
	print(f"Loading pipeline '{CKPT_ID}' with torch_dtype=torch.bfloat16...")
	pipe = DiffusionPipeline.from_pretrained(CKPT_ID, torch_dtype=torch.bfloat16)
	print("Pipeline loaded.")

	# --- 2. Apply Memory Optimizations ---
	# Analysis:
	# - Pipeline memory (31.424 GB) exceeds available GPU VRAM (23.98 GB).
	# - System RAM (125.54 GB) is large.
	# Strategy: Use `enable_model_cpu_offload()`. This moves model components to CPU when not
	# in use, swapping them to GPU on demand. This is ideal when VRAM is insufficient but system
	# RAM is abundant.

	print("Applying memory optimization: `pipe.enable_model_cpu_offload()`...")
	pipe.enable_model_cpu_offload()
	print("Model CPU offloading enabled. Components will dynamically move between CPU and GPU.")

	# --- 3. Apply Speed Optimizations (torch.compile) ---
	# Analysis:
	# - `enable_torch_compile` is True.
	# - Model offloading (`enable_model_cpu_offload`) is applied.
	# Strategy: Enable torch.compile with `recompile_limit` as offloading is used.
	# Do not use `fullgraph=True` when offloading is active.

	print("Applying speed optimization: `torch.compile()`...")
	torch._dynamo.config.recompile_limit = 1000 # Recommended when offloading is applied.
	# torch._dynamo.config.capture_dynamic_output_shape_ops = True # Only for bitsandbytes, not applicable here.

	# Compile the main computational component (e.g., transformer or unet).
	# FLUX models primarily use a transformer. For other models, it might be `pipe.unet`.
	if hasattr(pipe, "transformer"):
	print("Compiling `pipe.transformer`...")
	pipe.transformer.compile()
	elif hasattr(pipe, "unet"):
	print("Compiling `pipe.unet`...")
	pipe.unet.compile()
	else:
	print("Warning: Neither `pipe.transformer` nor `pipe.unet` found for compilation. Skipping `torch.compile` for core component.")

	print("Speed optimizations applied.")

	# --- 4. Perform Inference ---
	print(f"Starting image generation for prompt: '{PROMPT}'")
	# The pipeline handles component swapping automatically due to `enable_model_cpu_offload()`.
	image = pipe(PROMPT).images[0]
	print("Image generation complete.")

	# --- 5. Save or Display the Result ---
	output_path = "generated_image.png"
	image.save(output_path)
	print(f"Generated image saved to '{output_path}'")

	print("\n--- Inference process finished successfully ---")
	```
	````
	</details>

	### More outputs

	<details>
	<summary>"Wan-AI/Wan2.1-T2V-14B-Diffusers" with lossy outputs enabled</summary>

	````sh
	System RAM: 125.54 GB
	RAM Category: large

	GPU VRAM: 23.99 GB
	VRAM Category: medium
	("current_generate_prompt='\\nckpt_id: "
	'Wan-AI/Wan2.1-T2V-14B-Diffusers\\npipeline_loading_memory_GB: '
	'37.432\\navailable_system_ram_GB: '
	'125.54026794433594\\navailable_gpu_vram_GB: '
	'23.98828125\\nenable_lossy_outputs: True\\nis_fp8_supported: '
	"True\\nenable_torch_compile: True\\n'")
	Sending request to Gemini...
	```python
	from diffusers import DiffusionPipeline
	from diffusers.quantizers import PipelineQuantizationConfig
	import torch

	ckpt_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"

	quant_config = PipelineQuantizationConfig(
	quant_backend="torchao",
	quant_kwargs={"quant_type": "float8dq_e4m3_row"},
	components_to_quantize=["transformer"]
	)
	pipe = DiffusionPipeline.from_pretrained(ckpt_id, quantization_config=quant_config, torch_dtype=torch.bfloat16)

	# Apply model CPU offload due to VRAM constraints
	pipe.enable_model_cpu_offload()

	# torch.compile() configuration
	torch._dynamo.config.recompile_limit = 1000
	pipe.transformer.compile()
	# pipe.vae.decode = torch.compile(pipe.vae.decode) # Uncomment if you want to compile VAE decode as well

	prompt = "photo of a dog sitting beside a river"

	# Modify the pipe call arguments as needed.
	image = pipe(prompt).images[0]

	# You can save the image or perform further operations here
	# image.save("generated_image.png")
	```
	````
	</details>
	<small>Ran on an RTX 4090</small>