Spaces:

KhookieThief
/

test

Sleeping

App Files Files Community

test / skill_example /references /diffusers-integration.md

Jack-Khuu

Demo

88a1dd2 11 days ago

preview code

raw

history blame contribute delete

22.2 kB

	# Diffusers Pipeline Integration Guide

	## Overview

	This guide covers the complete process of integrating custom CUDA kernels into HuggingFace Diffusers pipelines. It includes model architecture analysis, custom processor creation, kernel injection, and verification.

	## Model Architecture Analysis

	Before writing any kernels, you must understand the target model's architecture. The key operations to identify are:

	1. Normalization layers (RMSNorm, LayerNorm, GroupNorm)
	2. Activation functions (GELU, GEGLU, SiLU)
	3. Attention mechanisms (self-attention, cross-attention)
	4. Linear projections (QKV projections, feed-forward)
	5. Positional encodings (RoPE, learned, sinusoidal)

	### Inspecting a Diffusers Model

	```python
	from diffusers import LTXPipeline
	import torch

	pipe = LTXPipeline.from_pretrained(
	"Lightricks/LTX-Video",
	torch_dtype=torch.bfloat16
	)

	# List all module types
	module_types = set()
	for name, module in pipe.transformer.named_modules():
	module_types.add(type(module).__name__)

	print("Module types found:")
	for t in sorted(module_types):
	print(f" - {t}")

	# Count occurrences of each type
	from collections import Counter
	type_counts = Counter(
	type(m).__name__ for _, m in pipe.transformer.named_modules()
	)
	print("\nModule counts:")
	for name, count in type_counts.most_common():
	print(f" {name}: {count}")
	```

	### Identifying Optimization Targets

	Look for operations that are:

	1. Frequently called -- operations inside the main transformer blocks
	2. Memory-bound -- normalization, activations, and element-wise operations
	3. Fusible -- adjacent operations that can be combined into a single kernel

	```python
	# Trace the model to understand the call graph
	with torch.no_grad():
	# Create dummy inputs matching the model's expected shapes
	sample = torch.randn(1, 16, 32, 32, dtype=torch.bfloat16, device="cuda")
	timestep = torch.tensor([500.0], device="cuda")
	encoder_hidden_states = torch.randn(1, 77, 2048, dtype=torch.bfloat16, device="cuda")

	# Profile with PyTorch profiler
	with torch.profiler.profile(
	activities=[torch.profiler.ProfilerActivity.CUDA],
	record_shapes=True,
	) as prof:
	output = pipe.transformer(sample, timestep, encoder_hidden_states)

	print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
	```

	## LTX-Video Architecture

	LTX-Video is a video generation model based on a Transformer architecture. Here is its structure:

	```
	LTXVideoTransformer3DModel
	├── patch_embed # Patch embedding (Conv3D)
	├── time_embed # Timestep embedding
	├── transformer_blocks # Main transformer blocks (N layers)
	│ ├── norm1 # RMSNorm (pre-attention)
	│ ├── attn1 # Self-attention
	│ │ ├── to_q # Linear (query projection)
	│ │ ├── to_k # Linear (key projection)
	│ │ ├── to_v # Linear (value projection)
	│ │ ├── to_out[0] # Linear (output projection)
	│ │ └── processor # Attention processor
	│ ├── norm2 # RMSNorm (pre-FFN)
	│ ├── ff # Feed-forward network
	│ │ ├── net[0] # GELU activation (NOT GEGLU!)
	│ │ └── net[2] # Linear (down projection)
	│ ├── norm3 # RMSNorm (pre-cross-attention, if present)
	│ └── attn2 # Cross-attention (if present)
	├── norm_out # Final RMSNorm
	└── proj_out # Output projection
	```

	### Key Observations for LTX-Video

	\| Feature \| Detail \|
	\|---\|---\|
	\| Normalization \| RMSNorm (diffusers version) \|
	\| Activation \| GELU (not GEGLU!) \|
	\| Attention \| Standard scaled dot-product \|
	\| Positional encoding \| RoPE (Rotary Position Embedding) \|
	\| Precision \| BF16 preferred \|
	\| Hidden sizes \| Varies by model variant (1024, 2048, etc.) \|

	## Custom Attention Processor

	Diffusers uses a processor pattern for attention. You can replace the default processor with a custom one that uses your optimized kernels.

	### OptimizedLTXVideoAttnProcessor

	```python
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	from typing import Optional

	class OptimizedLTXVideoAttnProcessor:
	"""
	Custom attention processor for LTX-Video that uses optimized CUDA kernels.

	This replaces the default attention computation with:
	1. Fused QKV projection (optional)
	2. RoPE application via custom kernel
	3. Flash Attention or custom attention kernel
	4. Fused output projection (optional)
	"""

	def __init__(
	self,
	use_custom_rope: bool = True,
	use_custom_softmax: bool = False,
	):
	self.use_custom_rope = use_custom_rope
	self.use_custom_softmax = use_custom_softmax

	def __call__(
	self,
	attn, # The Attention module
	hidden_states: torch.Tensor, # [batch, seq_len, hidden_dim]
	encoder_hidden_states: Optional[torch.Tensor] = None,
	attention_mask: Optional[torch.Tensor] = None,
	temb: Optional[torch.Tensor] = None,
	image_rotary_emb: Optional[torch.Tensor] = None,
	) -> torch.Tensor:
	# Determine if this is self-attention or cross-attention
	is_cross_attention = encoder_hidden_states is not None
	input_states = encoder_hidden_states if is_cross_attention else hidden_states

	batch_size, seq_len, _ = hidden_states.shape

	# QKV projections
	query = attn.to_q(hidden_states)
	key = attn.to_k(input_states)
	value = attn.to_v(input_states)

	# Reshape for multi-head attention
	head_dim = attn.head_dim
	num_heads = attn.heads

	query = query.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
	key = key.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
	value = value.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)

	# Apply RoPE if provided
	if image_rotary_emb is not None:
	if self.use_custom_rope:
	# Use custom CUDA RoPE kernel
	query = cuda_apply_rope(query, image_rotary_emb)
	key = cuda_apply_rope(key, image_rotary_emb)
	else:
	# Fallback to PyTorch implementation
	query = apply_rotary_emb(query, image_rotary_emb)
	key = apply_rotary_emb(key, image_rotary_emb)

	# Scaled dot-product attention
	# PyTorch's SDPA will use Flash Attention on H100 when available
	attn_output = F.scaled_dot_product_attention(
	query, key, value,
	attn_mask=attention_mask,
	dropout_p=0.0,
	is_causal=False,
	)

	# Reshape back
	attn_output = attn_output.transpose(1, 2).contiguous()
	attn_output = attn_output.view(batch_size, -1, num_heads * head_dim)

	# Output projection
	attn_output = attn.to_out[0](attn_output)
	# Dropout (if any)
	if len(attn.to_out) > 1:
	attn_output = attn.to_out[1](attn_output)

	return attn_output
	```

	### Setting the Processor

	```python
	from diffusers.models.attention_processor import AttnProcessor

	def set_custom_attention_processors(model):
	"""Replace attention processors in all transformer blocks."""
	processors = {}

	for name, module in model.named_modules():
	if hasattr(module, 'set_processor'):
	processors[name] = OptimizedLTXVideoAttnProcessor(
	use_custom_rope=True,
	use_custom_softmax=False,
	)

	model.set_attn_processor(processors)
	print(f"Set {len(processors)} custom attention processors")
	return model
	```

	## RMSNorm Module Patcher

	The RMSNorm patcher replaces the forward method of all RMSNorm modules with a custom CUDA implementation.

	```python
	import torch
	import torch.nn as nn
	from diffusers.models.normalization import RMSNorm
	from functools import wraps

	class RMSNormPatcher:
	"""
	Patches RMSNorm modules in a model to use custom CUDA kernels.

	Usage:
	patcher = RMSNormPatcher()
	patcher.patch(model)
	# ... run inference ...
	patcher.unpatch(model) # Restore original behavior
	"""

	def __init__(self, cuda_rmsnorm_fn=None, cuda_rmsnorm_no_weight_fn=None):
	"""
	Args:
	cuda_rmsnorm_fn: Custom CUDA function for weighted RMSNorm.
	Signature: (input: Tensor, weight: Tensor, eps: float) -> Tensor
	cuda_rmsnorm_no_weight_fn: Custom CUDA function for unweighted RMSNorm.
	Signature: (input: Tensor, eps: float) -> Tensor
	"""
	self.cuda_rmsnorm = cuda_rmsnorm_fn
	self.cuda_rmsnorm_no_weight = cuda_rmsnorm_no_weight_fn
	self._original_forwards = {}

	def patch(self, model: nn.Module) -> int:
	"""
	Patch all RMSNorm modules in the model.

	Returns:
	Number of modules patched.
	"""
	count = 0
	for name, module in model.named_modules():
	if isinstance(module, RMSNorm):
	# Save original forward
	self._original_forwards[name] = module.forward

	# Create patched forward
	module.forward = self._make_patched_forward(module)
	count += 1

	return count

	def unpatch(self, model: nn.Module) -> int:
	"""Restore original forward methods."""
	count = 0
	for name, module in model.named_modules():
	if name in self._original_forwards:
	module.forward = self._original_forwards[name]
	count += 1

	self._original_forwards.clear()
	return count

	def _make_patched_forward(self, module):
	"""Create a patched forward function for an RMSNorm module."""
	cuda_fn = self.cuda_rmsnorm
	cuda_fn_no_weight = self.cuda_rmsnorm_no_weight

	def patched_forward(hidden_states: torch.Tensor) -> torch.Tensor:
	# Handle the case where weight is None
	if module.weight is None:
	if cuda_fn_no_weight is not None:
	return cuda_fn_no_weight(hidden_states, module.eps)
	else:
	# Fallback: manual RMSNorm without weight
	variance = hidden_states.pow(2).mean(-1, keepdim=True)
	return hidden_states * torch.rsqrt(variance + module.eps)
	else:
	if cuda_fn is not None:
	return cuda_fn(hidden_states, module.weight, module.eps)
	else:
	# Fallback: manual RMSNorm with weight
	variance = hidden_states.pow(2).mean(-1, keepdim=True)
	hidden_states = hidden_states * torch.rsqrt(variance + module.eps)
	return hidden_states * module.weight

	return patched_forward
	```

	## Kernel Injection Function

	The main injection function ties everything together:

	```python
	import torch
	import torch.nn as nn
	from typing import Optional, Dict, Any

	def inject_custom_kernels(
	model: nn.Module,
	kernel_config: Optional[Dict[str, Any]] = None,
	) -> nn.Module:
	"""
	Inject optimized CUDA kernels into a diffusers model.

	This function patches:
	1. RMSNorm layers -> custom CUDA RMSNorm
	2. Attention processors -> custom attention processor
	3. Activation functions -> custom CUDA activations

	Args:
	model: A diffusers model (e.g., pipe.transformer)
	kernel_config: Optional configuration dict with keys:
	- 'rmsnorm': bool (default True)
	- 'attention': bool (default True)
	- 'activation': bool (default True)
	- 'rope': bool (default True)

	Returns:
	The patched model (modified in place)
	"""
	config = {
	'rmsnorm': True,
	'attention': True,
	'activation': True,
	'rope': True,
	}
	if kernel_config:
	config.update(kernel_config)

	stats = {'rmsnorm': 0, 'attention': 0, 'activation': 0}

	# Step 1: Patch RMSNorm
	if config['rmsnorm']:
	patcher = RMSNormPatcher(
	cuda_rmsnorm_fn=cuda_rmsnorm,
	cuda_rmsnorm_no_weight_fn=cuda_rmsnorm_no_weight,
	)
	stats['rmsnorm'] = patcher.patch(model)

	# Step 2: Patch attention processors
	if config['attention']:
	for name, module in model.named_modules():
	if hasattr(module, 'set_processor'):
	processor = OptimizedLTXVideoAttnProcessor(
	use_custom_rope=config['rope'],
	)
	module.set_processor(processor)
	stats['attention'] += 1

	# Step 3: Patch activation functions
	if config['activation']:
	stats['activation'] = _patch_activations(model)

	print(f"Kernel injection complete:")
	print(f" RMSNorm layers patched: {stats['rmsnorm']}")
	print(f" Attention processors patched: {stats['attention']}")
	print(f" Activation functions patched: {stats['activation']}")

	return model


	def _patch_activations(model: nn.Module) -> int:
	"""Patch activation functions with custom CUDA kernels."""
	count = 0

	for name, module in model.named_modules():
	# Check for GELU activation
	if isinstance(module, nn.GELU):
	original_forward = module.forward

	def make_gelu_forward():
	def forward(input):
	return cuda_gelu(input)
	return forward

	module.forward = make_gelu_forward()
	count += 1

	return count
	```

	## Model-Specific Differences

	Different diffusion models have different architectures. Here are the key differences:

	### LTX-Video

	```python
	# Architecture: Transformer-based video model
	# Normalization: RMSNorm (diffusers)
	# Activation: GELU (plain, NOT GEGLU)
	# Attention: Standard multi-head with RoPE
	# Positional encoding: 3D RoPE (spatial + temporal)
	# Weight: RMSNorm weight MAY be None

	def inject_ltx_video(pipe):
	model = pipe.transformer
	inject_custom_kernels(model, {
	'rmsnorm': True,
	'attention': True,
	'activation': True, # Patches GELU
	'rope': True,
	})
	```

	### Stable Diffusion 3 (SD3)

	```python
	# Architecture: MMDiT (Multi-Modal Diffusion Transformer)
	# Normalization: RMSNorm and AdaLayerNorm
	# Activation: GEGLU (gated, NOT plain GELU)
	# Attention: Joint attention (text + image in same sequence)
	# Positional encoding: Learned + RoPE

	def inject_sd3(pipe):
	model = pipe.transformer
	inject_custom_kernels(model, {
	'rmsnorm': True,
	'attention': True,
	'activation': True, # Patches GEGLU
	'rope': True,
	})

	# SD3-specific: Also patch AdaLayerNorm if supported
	_patch_adalayernorm(model)
	```

	### FLUX

	```python
	# Architecture: Similar to SD3 (MMDiT variant)
	# Normalization: RMSNorm
	# Activation: GEGLU
	# Attention: Joint attention with different block structure
	# Positional encoding: RoPE
	# Note: FLUX has single-stream and double-stream blocks

	def inject_flux(pipe):
	model = pipe.transformer
	inject_custom_kernels(model, {
	'rmsnorm': True,
	'attention': True,
	'activation': True, # Patches GEGLU
	'rope': True,
	})
	```

	### Comparison Table

	\| Feature \| LTX-Video \| SD3 \| FLUX \|
	\|---\|---\|---\|---\|
	\| Normalization \| RMSNorm \| RMSNorm + AdaLN \| RMSNorm \|
	\| Activation \| GELU \| GEGLU \| GEGLU \|
	\| Attention type \| Standard \| Joint (MMDiT) \| Joint (MMDiT) \|
	\| RoPE \| 3D (spatial+temporal) \| 2D (spatial) \| 2D (spatial) \|
	\| Weight may be None \| Yes \| Rare \| No \|
	\| `set_processor` \| Yes \| Yes \| Yes \|
	\| Block structure \| Uniform \| Uniform \| Single + Double stream \|
	\| Hidden sizes \| 1024-2048 \| 1536-4096 \| 3072 \|

	## Verification Steps

	After injecting kernels, verify correctness and performance.

	### Step 1: Numerical Correctness

	```python
	import torch

	def verify_correctness(pipe, rtol=1e-2, atol=1e-3):
	"""
	Verify that custom kernels produce numerically correct output.
	"""
	device = "cuda"
	dtype = torch.bfloat16

	# Generate reference output without custom kernels
	torch.manual_seed(42)
	pipe_ref = load_pipeline() # Fresh pipeline
	pipe_ref.to(device)

	with torch.no_grad():
	ref_output = pipe_ref(
	"a cat sitting on a mat",
	num_inference_steps=2,
	output_type="latent",
	generator=torch.Generator(device).manual_seed(42),
	).images

	# Generate output with custom kernels
	torch.manual_seed(42)
	inject_custom_kernels(pipe.transformer)

	with torch.no_grad():
	test_output = pipe(
	"a cat sitting on a mat",
	num_inference_steps=2,
	output_type="latent",
	generator=torch.Generator(device).manual_seed(42),
	).images

	# Compare
	max_diff = (ref_output - test_output).abs().max().item()
	mean_diff = (ref_output - test_output).abs().mean().item()

	print(f"Max absolute difference: {max_diff:.6f}")
	print(f"Mean absolute difference: {mean_diff:.6f}")

	is_close = torch.allclose(ref_output, test_output, rtol=rtol, atol=atol)
	print(f"Numerically close (rtol={rtol}, atol={atol}): {is_close}")

	return is_close
	```

	### Step 2: Module-Level Verification

	```python
	def verify_rmsnorm(hidden_size=2048, batch_size=4, eps=1e-6):
	"""Verify custom RMSNorm against PyTorch reference."""
	from diffusers.models.normalization import RMSNorm

	# Create reference module
	ref_norm = RMSNorm(hidden_size, eps=eps).cuda().to(torch.bfloat16)

	# Test input
	x = torch.randn(batch_size, 128, hidden_size, dtype=torch.bfloat16, device="cuda")

	# Reference output
	with torch.no_grad():
	ref_out = ref_norm(x)

	# Custom kernel output
	with torch.no_grad():
	custom_out = cuda_rmsnorm(x, ref_norm.weight, eps)

	max_diff = (ref_out - custom_out).abs().max().item()
	print(f"RMSNorm max diff: {max_diff:.8f}")
	assert max_diff < 1e-2, f"RMSNorm verification failed: max_diff={max_diff}"
	print("RMSNorm verification PASSED")
	```

	### Step 3: Performance Verification

	```python
	import time
	import torch

	def verify_performance(pipe, num_runs=5, num_steps=20):
	"""
	Verify that custom kernels provide a speedup.
	"""
	prompt = "A beautiful sunset over the ocean, 4K, detailed"
	generator = torch.Generator("cuda").manual_seed(42)

	# Warmup
	pipe(prompt, num_inference_steps=2, generator=generator)

	# Benchmark
	times = []
	for i in range(num_runs):
	torch.cuda.synchronize()
	start = time.perf_counter()

	pipe(
	prompt,
	num_inference_steps=num_steps,
	generator=torch.Generator("cuda").manual_seed(42),
	)

	torch.cuda.synchronize()
	end = time.perf_counter()
	times.append(end - start)

	avg_time = sum(times) / len(times)
	std_time = (sum((t - avg_time) 2 for t in times) / len(times)) 0.5

	print(f"Average time: {avg_time:.3f}s +/- {std_time:.3f}s")
	print(f"Per-step time: {avg_time / num_steps * 1000:.1f}ms")

	return avg_time
	```

	### Step 4: Edge Case Testing

	```python
	def test_edge_cases():
	"""Test edge cases that commonly cause issues."""

	# Test 1: RMSNorm with None weight
	print("Test 1: RMSNorm with None weight...")
	x = torch.randn(2, 64, 2048, dtype=torch.bfloat16, device="cuda")
	try:
	result = cuda_rmsnorm_no_weight(x, 1e-6)
	assert result.shape == x.shape
	print(" PASSED")
	except Exception as e:
	print(f" FAILED: {e}")

	# Test 2: Non-contiguous input
	print("Test 2: Non-contiguous input...")
	x = torch.randn(2, 2048, 64, dtype=torch.bfloat16, device="cuda").transpose(1, 2)
	assert not x.is_contiguous()
	try:
	result = cuda_rmsnorm(x.contiguous(), weight, 1e-6)
	print(" PASSED")
	except Exception as e:
	print(f" FAILED: {e}")

	# Test 3: Very small hidden size
	print("Test 3: Small hidden size (64)...")
	x = torch.randn(2, 128, 64, dtype=torch.bfloat16, device="cuda")
	w = torch.randn(64, dtype=torch.bfloat16, device="cuda")
	try:
	result = cuda_rmsnorm(x, w, 1e-6)
	assert result.shape == x.shape
	print(" PASSED")
	except Exception as e:
	print(f" FAILED: {e}")

	# Test 4: Very large hidden size
	print("Test 4: Large hidden size (8192)...")
	x = torch.randn(1, 32, 8192, dtype=torch.bfloat16, device="cuda")
	w = torch.randn(8192, dtype=torch.bfloat16, device="cuda")
	try:
	result = cuda_rmsnorm(x, w, 1e-6)
	assert result.shape == x.shape
	print(" PASSED")
	except Exception as e:
	print(f" FAILED: {e}")

	# Test 5: Single element batch
	print("Test 5: Single element batch...")
	x = torch.randn(1, 1, 2048, dtype=torch.bfloat16, device="cuda")
	w = torch.randn(2048, dtype=torch.bfloat16, device="cuda")
	try:
	result = cuda_rmsnorm(x, w, 1e-6)
	assert result.shape == x.shape
	print(" PASSED")
	except Exception as e:
	print(f" FAILED: {e}")

	print("\nAll edge case tests complete.")
	```

	## Complete Integration Example

	```python
	import torch
	from diffusers import LTXPipeline

	def main():
	# Load pipeline
	pipe = LTXPipeline.from_pretrained(
	"Lightricks/LTX-Video",
	torch_dtype=torch.bfloat16,
	)

	# Step 1: Inject custom kernels BEFORE moving to device or enabling offloading
	inject_custom_kernels(pipe.transformer)

	# Step 2: Move to device or enable offloading
	pipe.enable_model_cpu_offload()

	# Step 3: Optionally compile for additional speedup
	pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead")

	# Step 4: Run inference
	output = pipe(
	"A time-lapse of a flower blooming",
	num_inference_steps=30,
	num_frames=16,
	)

	# Step 5: Save output
	output.frames[0].save("output.mp4")

	if __name__ == "__main__":
	main()
	```