AI-QR-code-generator / SDXL_ControlNet_Brightness_Training_Plan.md
Oysiyl's picture
Add comprehensive accelerate and H100 optimization guide
82b2bf3

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Training ControlNet Brightness for SDXL - Feasibility Analysis

Executive Summary

Training a brightness ControlNet for SDXL is technically feasible and recommended as the critical upgrade path from SD 1.5 to SDXL for QR code generation. This model is essential because no public SDXL brightness ControlNet exists.

Key Estimates (Updated December 2024 - Single H100 GPU):

  • Time: 45 minutes (99k samples) to 24 hours (3M samples) on single H100
  • Cost: $13 (99k) to $418 (3M) in GPU credits
  • Platform: Lightning.ai with optional Pro plan ($20/month for multi-GPU)
  • Priority: High - enables SDXL migration for QR code generation
  • Complexity: Medium - well-documented training pipeline with reference implementation

Recommended Path:

  • Start with single H100 for 99k samples (~45 min, $13)
  • If successful, optionally upgrade to Pro plan for faster 3M training
  • Total investment: $13-$138 depending on training size and plan choice

Background Context

Current Implementation (SD 1.5)

  • Location: app.py:1880-1886, 2343-2349
  • Model: control_v1p_sd15_brightness.safetensors from latentcat/latentcat-controlnet
  • Purpose: Controls QR code pattern visibility via brightness conditioning
  • Critical: Essential for QR code readability - cannot be removed

Why SDXL Brightness ControlNet is Needed

  1. No Public Alternative: No SDXL-equivalent brightness ControlNet exists on HuggingFace
  2. Migration Blocker: Current SD 1.5 brightness ControlNet incompatible with SDXL architecture
  3. QR Readability: Brightness control is core to balancing aesthetic quality with QR scannability
  4. Flux is Too Heavy: SDXL is the practical upgrade path (Flux requires 32-40GB VRAM)

Flux Model Landscape (Updated Analysis)

Flux Schnell (Apache 2.0 License)

  • License: Fully open for commercial use - no restrictions
  • Architecture: Same 12B parameters as Flux Dev, but distilled for speed (3Γ— faster)
  • Quality: Lower than Dev due to aggressive distillation trading detail for speed
  • VRAM: Still requires 32-40GB (same as Dev)
  • ControlNet Status: ⚠️ No existing ControlNet models or training scripts
  • Training Risk: Would require adapting Flux Dev training script - pioneering work
  • Community: Active requests for Schnell ControlNets but no official releases

Flux Dev (Non-Commercial License)

  • License: Non-commercial only - cannot be used for commercial QR code generation
  • ControlNet Status: βœ… Extensive support (XLabs-AI, InstantX collections)
  • Training Scripts: Available from XLabs-AI and HuggingFace Diffusers
  • Quality: Superior to Schnell, but license restrictions make it unsuitable

Flux Pro (Commercial API)

  • License: API-only, commercial pricing
  • Status: Not suitable for self-hosted training

Assessment: While Flux Schnell has an attractive license, the lack of proven ControlNet training pipeline makes it high-risk. SDXL remains the proven, practical choice.

Hardware Selection & Platform Strategy

Lightning.ai Pricing Tiers (December 2024)

Lightning.ai offers different tiers with varying multi-GPU capabilities:

Plan Cost Multi-GPU Max GPUs Credits Included Best For
Free $0 ❌ No 1 15/month Quick 99k test
Pro $20/month (annual) βœ… Yes 6 240/year (~$13/mo) Recommended
Teams $119/month (annual) βœ… Yes 12 600/year Large teams

Pro Plan Benefits:

  • Only $20/month if paid annually ($240/year vs $600 monthly)
  • Includes 240 credits/year = ~$13 of free GPU time
  • Net cost: ~$7/month after credits
  • Multi-GPU training up to 6 GPUs
  • Can cancel after training completes

GPU Comparison Analysis (Lightning.ai)

Single GPU Performance:

GPU TFLOPs Memory Cost/hr 99k Time 99k Cost 3M Time 3M Cost
A100 312 40GB ~$1.50 4-6 hours $6-9 120-180 hours $180-270
H100 1979 80GB ~$2.50 45 min $1.88 24 hours $60

Cost Efficiency:

  • H100 is 6.3Γ— faster than A100 (1979 vs 312 TFLOPs)
  • H100 costs 1.67Γ— more per hour on Lightning.ai
  • Net result: 3.8Γ— better cost efficiency

Single vs Multi-GPU: Should You Get Pro Plan?

Option A: Free Plan (Single H100)

Training Size Duration GPU Cost Total Cost Timeline
99k samples 45 min $1.88 $1.88 Same day
500k samples 4 hours $10 $10 Same day
3M samples 24 hours $60 $60 1-2 days

Pros:

  • βœ… $0 subscription cost
  • βœ… Very cheap for 99k testing
  • βœ… Good for one-off training

Cons:

  • ❌ 24 hours for 3M training (must babysit)
  • ❌ Can't test multiple hyperparameters quickly
  • ❌ Limited to 15 free credits/month

Option B: Pro Plan (6Γ— H100)

Training Size Duration GPU Cost Subscription Total Cost Timeline
99k samples 7.5 min $1.88 $20 $21.88 Minutes
500k samples 40 min $10 $20 $30 Same hour
3M samples 4 hours $60 $20 $80 Same day

Multi-GPU costs same because:

  • 6Γ— GPUs = 6Γ— faster
  • 6Γ— GPUs = 6Γ— more expensive per hour
  • Net: Same total GPU cost, much faster completion

Pros:

  • βœ… 3M training finishes in 4 hours (vs 24)
  • βœ… Can test 3-4 hyperparameter configs in one day
  • βœ… Includes 240 credits/year (~$13 value)
  • βœ… Real net cost: $7/month after credits
  • βœ… Can cancel after training done

Cons:

  • ❌ $20 upfront cost (annual commitment)

Recommendation Matrix

If you're doing ONE 99k training run:

  • βœ… Use Free tier ($1.88 total, 45 min)
  • Skip Pro plan - not worth $20 for 7.5 min vs 45 min

If you're doing 500k OR 3M training:

  • βœ… Get Pro plan ($20/month)
  • 3M: 4 hours vs 24 hours = worth it
  • Can test multiple configs same day
  • Net cost after credits: ~$7/month

If you're doing multiple experiments:

  • βœ… Definitely get Pro plan
  • Test 99k + 500k + 3M all in one day
  • Total time: ~5 hours vs 30+ hours
  • Total cost: $20 + ~$72 GPU = $92
  • Cancel Pro after training complete

Most Cost-Effective Strategy:

  1. Start with Free tier for 99k test ($1.88, 45 min)
  2. If results promising, upgrade to Pro for 3M training
  3. Run full training in 4 hours
  4. Cancel Pro after done
  5. Total: $20 Pro + $60 GPU + $1.88 test = $81.88

Updated Training Timeline Estimates

Single H100 (Free Tier):

Training Size Duration Total Cost When to Use
99k samples 45 min $1.88 Quick validation, hyperparameter testing
500k samples 4 hours $10 Medium quality, budget option
3M samples 24 hours $60 Max quality, have patience

6Γ— H100 (Pro Plan at $20/month):

Training Size Duration Total Cost When to Use
99k samples 7.5 min $21.88 Ultra-fast iteration
500k samples 40 min $30 Production ready, same day
3M samples 4 hours $80 Best quality, same day results

Training Strategy

Dataset: latentcat/grayscale_image_aesthetic_3M

  • Size: 3 million images at 512Γ—512 resolution
  • Format: Parquet files with image/conditioning_image/text columns
  • Same Dataset: Used for original SD 1.5 brightness ControlNet training
  • License: Latent Cat (check license before commercial use)
  • Quality: Pre-processed grayscale images with aesthetic filtering

Reference Training Results (from latentcat article)

Configuration Samples Hardware Duration Cost Estimate
Original SD 1.5 100k A6000 13 hours ~$20 (est.)
Original SD 1.5 3M TPU v4-8 25 hours N/A (TPU)

SDXL Training Scaling Estimates

Updated Based on Latentcat Article:

  • Training at 512Γ—512 resolution (NOT 1024Γ—1024) - matches dataset and original training
  • SDXL has larger UNet architecture (~2.5GB vs 1.7GB for SD 1.5)
  • Expected slowdown: 2-3Γ— compared to SD 1.5 training

Time Estimates for 99k Training Samples (Lightning.ai Single H100):

Calculation Methodology

Baseline Reference:

  • Latentcat article: 100k samples on A6000 = 13 hours (SD 1.5)
  • SDXL overhead: 13h Γ— 2.5 (larger architecture) = ~32.5 hours for 100k
  • A6000 β‰ˆ A100 in performance (~300-312 TFLOPs)

Scaling to H100:

  • A100: 312 TFLOPs β†’ ~4-6 hours for 99k samples
  • H100: 1979 TFLOPs β†’ 6.3Γ— faster
  • H100 single GPU: ~38-57 minutes for 99k samples

Multi-GPU Scaling (Pro Plan):

  • 6Γ— H100 GPUs = 6Γ— faster = ~7.5 minutes for 99k
  • Total cost stays same (6Γ— faster but 6Γ— more expensive/hour)

Recommended Configurations

πŸ† OPTION 1: Free Tier (Single H100) - Best for Testing

  • 99k samples: 45 min, $1.88
  • 500k samples: 4 hours, $10
  • 3M samples: 24 hours, $60
  • Best for: One-off training, budget-conscious, have patience

πŸš€ OPTION 2: Pro Plan (6Γ— H100) - Best for Production

  • Subscription: $20/month (annual), includes $13 credits = $7 net cost
  • 99k samples: 7.5 min, $21.88 total ($1.88 GPU + $20 sub)
  • 500k samples: 40 min, $30 total ($10 GPU + $20 sub)
  • 3M samples: 4 hours, $80 total ($60 GPU + $20 sub)
  • Best for: Multiple experiments, 3M training, need results same day

Cost Comparison Summary:

Scenario Free Tier Pro Plan Savings (Pro)
Single 99k test $1.88 $21.88 ❌ $20 more
Single 3M training $60 $80 ❌ $20 more
99k + 500k + 3M $71.88 (30 hours) $92 (5 hours) βœ… Save 25 hours
3+ experiments $71.88+ (30+ hours) $92 (5-6 hours) βœ… Save 24+ hours

Recommendation:

  • For single 99k test: Use Free Tier (not worth $20 for speed)
  • For 3M training: Consider Pro (4 hrs vs 24 hrs = big difference)
  • For multiple runs: Definitely Pro (can test everything in one day)

Technical Implementation Plan

Dataset Verification Script

Create this script to verify dataset before training:

cat > verify_dataset.py << 'EOF'
#!/usr/bin/env python3
"""
Dataset verification script for SDXL ControlNet Brightness training.
Downloads a subset of the dataset and verifies structure.

Usage: python verify_dataset.py
"""

from datasets import load_dataset
from PIL import Image
import sys

def verify_dataset():
    print("=" * 60)
    print("SDXL ControlNet Brightness - Dataset Verification")
    print("=" * 60)

    print("\n[1/4] Loading dataset subset (99k samples)...")
    print("This will download ~10-15GB to cache...")

    try:
        train_dataset = load_dataset(
            "latentcat/grayscale_image_aesthetic_3M",
            split="train[:99000]",
            cache_dir="~/.cache/huggingface/datasets"
        )
        print(f"βœ… Successfully loaded {len(train_dataset)} samples")
    except Exception as e:
        print(f"❌ Failed to load dataset: {e}")
        sys.exit(1)

    print("\n[2/4] Verifying dataset structure...")
    expected_columns = {"image", "conditioning_image", "text"}
    actual_columns = set(train_dataset.column_names)

    if actual_columns == expected_columns:
        print(f"βœ… Columns correct: {train_dataset.column_names}")
    else:
        print(f"❌ Column mismatch!")
        print(f"   Expected: {expected_columns}")
        print(f"   Got: {actual_columns}")
        sys.exit(1)

    print("\n[3/4] Checking sample data...")
    sample = train_dataset[0]

    # Check images
    if isinstance(sample['image'], Image.Image):
        img_size = sample['image'].size
        print(f"βœ… Image type: PIL.Image, size: {img_size}")
    else:
        print(f"❌ Unexpected image type: {type(sample['image'])}")

    if isinstance(sample['conditioning_image'], Image.Image):
        cond_size = sample['conditioning_image'].size
        print(f"βœ… Conditioning image type: PIL.Image, size: {cond_size}")
    else:
        print(f"❌ Unexpected conditioning image type: {type(sample['conditioning_image'])}")

    if isinstance(sample['text'], str):
        caption_len = len(sample['text'])
        print(f"βœ… Caption type: str, length: {caption_len} chars")
        print(f"   Sample caption: '{sample['text'][:100]}...'")
    else:
        print(f"❌ Unexpected caption type: {type(sample['text'])}")

    print("\n[4/4] Checking validation split (last 1000 samples)...")
    try:
        # IMPORTANT: Always use last 1000 samples for validation
        # This ensures consistent validation across all training sizes
        val_dataset = load_dataset(
            "latentcat/grayscale_image_aesthetic_3M",
            split="train[2999000:3000000]",
            cache_dir="~/.cache/huggingface/datasets"
        )
        print(f"βœ… Validation split loaded: {len(val_dataset)} samples")
        print(f"   Validation uses: train[2999000:3000000] (last 1k)")
    except Exception as e:
        print(f"❌ Failed to load validation split: {e}")
        sys.exit(1)

    print("\n" + "=" * 60)
    print("βœ… ALL CHECKS PASSED!")
    print("=" * 60)
    print(f"\nDataset cached at: ~/.cache/huggingface/datasets/")
    print(f"Training samples: {len(train_dataset)}")
    print(f"Validation samples: {len(val_dataset)}")
    print(f"\n⚠️  IMPORTANT: Validation always uses samples 2,999,000-2,999,999")
    print(f"   This ensures consistent validation across all training sizes")
    print(f"   (99k, 500k, 3M all use same validation set)")
    print(f"\nYou can now proceed with training!")
    print("The training script will automatically use this cached data.")

if __name__ == "__main__":
    verify_dataset()
EOF

Make executable and run:

chmod +x verify_dataset.py
python verify_dataset.py

Expected output: Should confirm dataset structure and cache the first 100k samples.

Manual Preparation Checklist (Do This First!)

Split into two phases to minimize GPU costs:


Part A: Local Preparation (BEFORE Launching GPU Instance)

Do these steps on your local machine or any CPU instance - no GPU needed, $0 cost:

Step 1: Get Your Authentication Tokens

Prepare these before launching GPU:

Save these somewhere - you'll need them on the GPU instance.

Step 2: Prepare Dataset Verification Script Locally

The full verify_dataset.py script is provided in the "Dataset Verification Script" section above (under Technical Implementation Plan).

You can either:

  • Copy that script to a file on your local machine, OR
  • Recreate it directly on the GPU instance in Part B below

No need to prepare this locally if you prefer to create it on the GPU instance.


Part B: GPU Instance Setup (AFTER Launching GPU, BEFORE Training)

Complete these steps on your GPU instance to avoid wasting GPU credits on training failures:

Estimated time: 30-60 minutes (mostly dataset download) GPU credits used: ~$0.75-$1.50 (30-60 min @ $1.55/hr for A100)

Step 1: System Dependencies

# Update system packages
sudo apt-get update && sudo apt-get install -y git git-lfs build-essential

# Initialize Git LFS
git lfs install

Step 2: Python Environment with CUDA

# Install PyTorch with CUDA 11.8 (requires GPU instance!)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install core ML libraries
pip install diffusers transformers accelerate datasets

# Install utilities
pip install huggingface_hub pillow wandb xformers bitsandbytes

Step 3: Verify CUDA (Critical!)

# Verify CUDA availability - MUST show "True"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0)}')"

Expected output:

CUDA available: True
CUDA version: 11.8
GPU: NVIDIA A100-SXM4-40GB

If CUDA shows False: Stop and troubleshoot before proceeding!

Step 4: Clone Training Repository

# Clone HuggingFace diffusers
git clone https://github.com/huggingface/diffusers.git
cd diffusers/examples/controlnet

# Verify training script exists
ls -la train_controlnet_sdxl.py  # Should show the file

Step 5: Authentication Setup

# Login to HuggingFace (use token from Part A)
huggingface-cli login
# Paste your token when prompted

# Login to Weights & Biases (use API key from Part A)
wandb login
# Paste your API key when prompted

Step 6: Dataset Verification (CRITICAL!)

# Create the verify_dataset.py script using the code from
# "Dataset Verification Script" section at the top of this plan
# (See lines after "Technical Implementation Plan" heading)

# Once created, run it:
chmod +x verify_dataset.py
python verify_dataset.py

Expected output: ```

SDXL ControlNet Brightness - Dataset Verification

[1/4] Loading dataset subset (99k samples)... This will download ~10-15GB to cache... βœ… Successfully loaded 99000 samples

[2/4] Verifying dataset structure... βœ… Columns correct: ['image', 'conditioning_image', 'text']

[3/4] Checking sample data... βœ… Image type: PIL.Image, size: (512, 512) βœ… Conditioning image type: PIL.Image, size: (512, 512) βœ… Caption type: str, length: 87 chars

[4/4] Checking validation split (last 1000 samples)... βœ… Validation split loaded: 1000 samples Validation uses: train[2999000:3000000] (last 1k)

============================================================ βœ… ALL CHECKS PASSED!

Dataset cached at: ~/.cache/huggingface/datasets/ Training samples: 99000 Validation samples: 1000

⚠️ IMPORTANT: Validation always uses samples 2,999,000-2,999,999 This ensures consistent validation across all training sizes (99k, 500k, 3M all use same validation set)

You can now proceed with training!


#### Step 7: Pre-Flight Verification
```bash
# Check all packages are installed
pip list | grep -E "torch|diffusers|transformers|accelerate|datasets|xformers"

# Check disk space (need ~20GB free for checkpoints)
df -h ~

# Verify dataset cache exists
ls -lh ~/.cache/huggingface/datasets/

Step 8: Create Output Directory

# Create directory for training outputs
mkdir -p ~/controlnet-brightness-sdxl

# Return to training directory
cd ~/diffusers/examples/controlnet

βœ… Preparation Complete!

Once all Part B steps pass, you're ready to start GPU training.

The training command (shown in Phase 3 below) will now:

  • βœ… Use pre-downloaded dataset from cache (no re-download)
  • βœ… Have all required libraries installed with CUDA support
  • βœ… Be authenticated to HuggingFace and W&B
  • βœ… Save checkpoints to the prepared directory

Total preparation cost: ~$0.75-$1.50 (vs $35 for full training) Why worth it: Catches setup issues early without wasting 25 hours of GPU time

Hardware Selection (Updated for Lightning.ai):

  • πŸ† RECOMMENDED FOR TESTING: Single H100 on Free Tier
    • 99k training in 45 min for $1.88
    • Perfect for validation and hyperparameter tuning
    • 80GB VRAM allows good batch sizes
    • No subscription required
  • πŸš€ RECOMMENDED FOR PRODUCTION: 6Γ— H100 on Pro Plan ($20/month annual)
    • 3M training in 4 hours for $80 total
    • Can test multiple configs in one day
    • Net cost: ~$7/month after included credits
    • Cancel subscription after training complete
  • Not Recommended: A100 - H100 is faster and more cost-efficient

Phase 2: Dataset Preparation

Dataset Split Strategy (for 99k quick training):

  • Training: 99,000 samples (split="train[:99000]")
  • Validation: 1,000 samples (split="train[2999000:3000000]") - ALWAYS last 1k
  • Total loaded: 100,000 samples (99k + last 1k of 3M dataset)

⚠️ CRITICAL: Validation Always Uses Last 1000 Samples

  • All training sizes (99k, 500k, 3M) use train[2999000:3000000] for validation
  • This ensures consistent validation set across all training runs
  • Allows fair comparison of model quality at different training stages
  • No overlap between training and validation for any training size

Why This Matters:

❌ WRONG: Using different validation sets for different training sizes
   - 99k training:  train[:99000] + validation train[99000:100000]
   - 500k training: train[:499000] + validation train[499000:500000]
   - 3M training:   train[:2999000] + validation train[2999000:3000000]
   Problem: Can't compare results! Each uses different validation data.

βœ… CORRECT: Same validation set for all training sizes
   - 99k training:  train[:99000] + validation train[2999000:3000000]
   - 500k training: train[:499000] + validation train[2999000:3000000]
   - 3M training:   train[:2999000] + validation train[2999000:3000000]
   Benefit: Fair comparison across all training runs on same validation set.

Understanding HuggingFace Dataset Caching

Important: The HuggingFace datasets library automatically caches all downloads to ~/.cache/huggingface/datasets/. This means:

βœ… Cache reuse is automatic: When the training script runs, it will check the cache first and reuse any previously downloaded data βœ… No re-downloads: You won't download the full 3M dataset if you've already downloaded a subset βœ… The pre-download step is OPTIONAL: The training command can handle downloading on its own

Pre-download Benefits:

  • Verify dataset structure before training starts
  • Separate download time from training time
  • Ensure dataset access works before committing GPU hours

Pre-download is NOT required: The training script's --max_train_samples=99000 parameter will work whether you pre-download or not.

Dataset Download Options

Option A: Pre-download for verification (RECOMMENDED)

from datasets import load_dataset

# This downloads and caches ~100k samples for verification
train_dataset = load_dataset(
    "latentcat/grayscale_image_aesthetic_3M",
    split="train[:99000]",
    cache_dir="~/.cache/huggingface/datasets"  # Default cache location
)

# Verify the dataset structure
print(f"Dataset size: {len(train_dataset)}")
print(f"Columns: {train_dataset.column_names}")
print(f"First sample keys: {train_dataset[0].keys()}")

# Check a sample
sample = train_dataset[0]
print(f"Image size: {sample['image'].size}")
print(f"Conditioning image size: {sample['conditioning_image'].size}")
print(f"Caption: {sample['text']}")

Option B: Let training script handle download

  • Simply run the training command with --dataset_name and --max_train_samples
  • The script will download to cache automatically
  • Slightly riskier if there are dataset access issues

Recommended: Use the full verify_dataset.py script (see "Dataset Verification Script" section above) which implements Option A with comprehensive validation checks.

Data Format Validation:

  • Verify columns: image, conditioning_image, text
  • Check image resolution: 512Γ—512 (will be upscaled to 1024Γ—1024 by script)
  • Validate grayscale format

Steps Calculation (IMPORTANT):

  • Training samples: 99,000
  • Batch size: 16
  • Gradient accumulation: 4
  • Effective batch size: 16 Γ— 4 = 64 samples/step
  • Steps per epoch: 99,000 Γ· 64 = 1,547 steps
  • For 2 epochs: ~3,094 total steps

Phase 3: Training Configuration

Prerequisites: Complete the "Manual Preparation Checklist" above before running this command.

Training Command (Based on Latentcat Article):

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=99000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=16 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=2 \
  --checkpointing_steps=1500 \
  --validation_steps=1500 \
  --tracker_project_name="brightness-controlnet-sdxl" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam

Key Parameters Explained:

  • --max_train_samples=99000: Limit to 99k samples (reserves 1k for validation)
  • --resolution=512: Match dataset resolution (latentcat article used 512, not 1024)
  • --learning_rate=1e-5: From latentcat article
  • --train_batch_size=16: From latentcat article
  • --gradient_accumulation_steps=4: Effective batch = 16 Γ— 4 = 64
  • --num_train_epochs=2: From latentcat article
  • --checkpointing_steps=1500: Save every 1500 STEPS (~once per epoch)
    • Total training: ~3,094 steps for 2 epochs
    • Checkpoints at: 1500, 3000 steps
  • --validation_steps=1500: Run validation every 1500 STEPS
  • --gradient_checkpointing: Reduces VRAM usage
  • --use_8bit_adam: Memory optimization
  • --enable_xformers_memory_efficient_attention: Memory-efficient attention

Critical Understanding - Steps vs Samples:

  • 1 STEP = processing 1 effective batch = 64 samples
  • Checkpoint every 1500 steps = every 1500 Γ— 64 = 96,000 samples (~1 epoch)
  • NOT checkpoint every 1500 samples!
  • Total steps for 2 epochs: 99,000 Γ· 64 Γ— 2 = 3,094 steps

VRAM Requirements with These Settings:

The settings above are optimized for memory efficiency:

  • --mixed_precision="fp16": Halves memory usage
  • --gradient_checkpointing: Trades compute for memory (~40% VRAM savings)
  • --use_8bit_adam: Reduces optimizer state memory
  • --enable_xformers_memory_efficient_attention: Memory-efficient attention

Estimated VRAM usage:

  • SDXL base model (FP16): ~6-7GB
  • ControlNet model: ~2.5GB
  • 8-bit Adam optimizer states: ~3-4GB
  • Gradients (with checkpointing): ~2-3GB
  • Activations (batch 16, 512Γ—512, gradient checkpointing): ~8-12GB
  • Total: ~22-28GB peak

GPU Compatibility:

GPU VRAM Will It Fit? Batch Size Notes
L4 24GB ⚠️ Tight 8-12 Reduce --train_batch_size to 8 or 12
A100 40GB 40GB βœ… Yes 16 Recommended - comfortable fit
A100 80GB 80GB βœ… Yes 16-24 Plenty of headroom, can increase batch
H100 80GB 80GB βœ… Yes 16-24 Fastest training, plenty of VRAM

Recommended: A100 40GB - The settings will fit comfortably with batch size 16.

If using L4 24GB, modify the command:

# Change this line:
  --train_batch_size=16 \
# To:
  --train_batch_size=8 \

This keeps effective batch size = 8 Γ— 4 = 32 (half of 64), but still works well.

Accelerate Configuration for Multi-GPU Training

Important: Multi-GPU training on Lightning.ai requires the Pro plan ($20/month annual).

Single GPU (Free Tier) - No Configuration Needed

For single GPU training on Free tier, accelerate launch works without any configuration:

# No accelerate config needed - auto-detects single GPU
accelerate launch train_controlnet_sdxl.py [args...]

Multi-GPU (Pro Plan) - Configure Before Training

For 6Γ— H100 training on Pro plan, configure accelerate once:

# Run configuration wizard
accelerate config

Configuration Options for 6Γ— H100:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU  # Use DataParallel for multiple GPUs
num_machines: 1  # Single machine with 6 GPUs
num_processes: 6  # One process per GPU
gpu_ids: all  # Use all available GPUs
mixed_precision: fp16  # Match training script
use_cpu: false
dynamo_backend: NO  # Disable torch.compile for compatibility

Quick Config (Non-Interactive):

# Create accelerate config file directly
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'EOF'
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 6
gpu_ids: all
mixed_precision: fp16
use_cpu: false
dynamo_backend: NO
EOF

Verify Configuration:

# Check configuration
accelerate env

# Test multi-GPU setup
accelerate test

Launch Multi-GPU Training:

# With configuration file, launch works same as single GPU
accelerate launch train_controlnet_sdxl.py [args...]

# Or specify config explicitly
accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \
  train_controlnet_sdxl.py [args...]

H100-Optimized Training Parameters

The H100 GPU has 80GB VRAM and 1979 TFLOPs, allowing for larger batch sizes and better optimization than A100.

Optimal Batch Size for H100

Default settings (designed for A100 40GB):

--train_batch_size=16
--gradient_accumulation_steps=4
# Effective batch size: 16 Γ— 4 = 64 samples/step
# VRAM usage: ~22-28GB

H100-optimized settings (80GB VRAM):

--train_batch_size=32  # 2Γ— larger than A100
--gradient_accumulation_steps=4
# Effective batch size: 32 Γ— 4 = 128 samples/step
# VRAM usage: ~40-48GB (still plenty of headroom)

Aggressive H100 settings (maximum throughput):

--train_batch_size=48  # 3Γ— larger than A100
--gradient_accumulation_steps=2  # Reduce accumulation since batch is larger
# Effective batch size: 48 Γ— 2 = 96 samples/step
# VRAM usage: ~55-65GB
# Faster training due to fewer gradient accumulation steps

Single H100 Training Command (99k samples)

Optimized for H100 80GB:

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-h100"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=99000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=32 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=2 \
  --checkpointing_steps=750 \
  --validation_steps=750 \
  --tracker_project_name="brightness-controlnet-sdxl-h100" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --dataloader_num_workers=8 \
  --set_grads_to_none

Key H100 Optimizations:

  • --train_batch_size=32 (vs 16 on A100) - 2Γ— larger batches
  • --gradient_accumulation_steps=4 - Effective batch = 128
  • --checkpointing_steps=750 - More frequent (every ~96k samples)
  • --dataloader_num_workers=8 - Faster data loading (H100 has 192 CPUs)
  • --set_grads_to_none - Faster than zero_grad() on modern GPUs

Expected Performance:

  • Steps per epoch: 99,000 Γ· 128 = 773 steps
  • Total steps (2 epochs): ~1,546 steps
  • Training time: ~38-45 minutes on single H100
  • Checkpoints saved at: 750, 1500 steps

6Γ— H100 Training Command (3M samples) - Pro Plan

For Pro plan multi-GPU training:

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-multi-h100"

# Configure accelerate for 6 GPUs (if not done already)
accelerate config  # Select MULTI_GPU, 6 processes

# Launch training
accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=2999000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=2 \
  --num_train_epochs=1 \
  --checkpointing_steps=2500 \
  --validation_steps=2500 \
  --tracker_project_name="brightness-controlnet-sdxl-3M" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --dataloader_num_workers=8 \
  --set_grads_to_none \
  --resume_from_checkpoint="latest"

Multi-GPU Optimizations:

  • --train_batch_size=24 per GPU Γ— 6 GPUs = 144 samples per step (before accumulation)
  • --gradient_accumulation_steps=2 - Effective batch = 144 Γ— 2 = 288
  • --checkpointing_steps=2500 - Save every ~720k samples
  • --resume_from_checkpoint="latest" - Auto-resume if interrupted

Expected Performance:

  • Effective batch size: 288 samples/step
  • Steps per epoch: 2,999,000 Γ· 288 = ~10,413 steps
  • Training time: ~4 hours on 6Γ— H100
  • Checkpoints: 2500, 5000, 7500, 10000 steps + final

Batch Size Selection Guide

GPU Config VRAM Recommended batch_size grad_accum_steps Effective Batch Training Speed
Single L4 24GB 8 4 32 Slow (baseline)
Single A100 40GB 16 4 64 2Γ— faster than L4
Single H100 80GB 32 4 128 6Γ— faster than L4
6Γ— H100 (Pro) 480GB 24/GPU 2 288 36Γ— faster than L4

Rule of Thumb:

  • Larger train_batch_size = better GPU utilization, faster training
  • Larger effective_batch_size = more stable training, better convergence
  • H100 can handle 2-3Γ— larger batch sizes than A100 with same settings

Memory Optimization Tips

If you encounter OOM (Out of Memory) errors on H100:

  1. Reduce batch size incrementally:

    --train_batch_size=32  # Start here
    --train_batch_size=24  # If OOM
    --train_batch_size=16  # If still OOM
    
  2. Enable additional memory optimizations:

    --gradient_checkpointing \  # Already enabled
    --use_8bit_adam \           # Already enabled
    --enable_xformers_memory_efficient_attention \  # Already enabled
    --set_grads_to_none \       # Use this instead of zero_grad()
    
  3. Use gradient accumulation to maintain effective batch size:

    # If reducing from batch_size=32 to batch_size=16
    --train_batch_size=16
    --gradient_accumulation_steps=8  # Double accumulation to keep effective=128
    

Full 3M Dataset Training Options

For maximum quality training on the complete dataset:

Option A: Single H100 (Free Tier)

Metric Value
GPU 1Γ— H100 80GB (~$2.50/hr on Lightning.ai)
Dataset 2,999,000 training + 1,000 validation
Estimated Duration ~24 hours
Estimated Cost $60 GPU credits
Subscription Cost $0 (Free tier)
Total Cost $60
Checkpoints Every 5000 steps (~every 320k samples)

Pros:

  • βœ… Lowest total cost
  • βœ… No subscription required
  • βœ… Good for one-time training

Cons:

  • ❌ 24 hours training time (must monitor)
  • ❌ Can't quickly iterate if issues arise

Option B: 6Γ— H100 (Pro Plan - $20/month)

Metric Value
GPU 6Γ— H100 80GB (~$2.50/hr Γ— 6 = $15/hr)
Dataset 2,999,000 training + 1,000 validation
Estimated Duration ~4 hours
Estimated Cost $60 GPU credits
Subscription Cost $20/month (annual billing)
Total Cost $80
Net Cost $67 (after $13 annual credit value)
Checkpoints Every 5000 steps (~every 320k samples)

Pros:

  • βœ… Completes in 4 hours vs 24 hours
  • βœ… Can run same-day if needed
  • βœ… Can test multiple configs quickly
  • βœ… Net cost only $7/month after credits
  • βœ… Can cancel after training

Cons:

  • ❌ $20 upfront subscription cost

Scaling Math:

  • Single H100: 99k in 45 min β†’ 3M in 45 min Γ— 30.3 = ~24 hours
  • 6Γ— H100: 24 hours Γ· 6 = ~4 hours

Cost Comparison:

  • Free tier: $60, 24 hours wait
  • Pro plan: $80, 4 hours wait
  • Price difference: $20 to save 20 hours

Adjusted Training Command

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-3M"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=2999000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=1 \
  --checkpointing_steps=5000 \
  --validation_steps=5000 \
  --validation_prompts="a beautiful garden scene" "modern city street" "abstract art pattern" \
  --tracker_project_name="brightness-controlnet-sdxl-3M" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --resume_from_checkpoint="latest"

Key Adjustments Explained

Batch Size Scaling:

  • --train_batch_size=24 (increased from 16)
    • H100 80GB has 2x VRAM of A100 40GB
    • Can safely increase batch size by 50%
    • Alternative: --train_batch_size=32 if you have headroom
  • --gradient_accumulation_steps=4 (kept same)
    • Effective batch size: 24 Γ— 4 = 96 samples/step
    • If using batch_size=32: 32 Γ— 4 = 128 samples/step

Dataset & Checkpointing:

  • --max_train_samples=2999000 (vs 99,000 for quick training)
    • Training split: train[:2999000] (first 2,999,000 samples)
    • Validation split: train[2999000:3000000] (SAME as 99k training!)
    • βœ… This allows direct comparison of validation metrics between 99k and 3M training
    • βœ… No overlap between training and validation data
  • --num_train_epochs=1 (vs 2)
    • For 3M samples, 1 epoch is usually sufficient
    • Can increase to 2 if quality needs improvement
  • --checkpointing_steps=5000 (vs 1,500)
    • More frequent checkpoints would create too many files
    • 5000 steps = every ~480k samples
    • Total checkpoints: ~6-7 for full run
  • --validation_steps=5000 (matches checkpointing)
    • Run validation at each checkpoint

Resumption:

  • --resume_from_checkpoint="latest"
    • CRITICAL for multi-day training
    • If training crashes, automatically resumes from last checkpoint
    • Saves days of retraining if interrupted

Training Math

Steps Calculation:

  • Training samples: 2,999,000 (validation: 1,000)
  • Effective batch size: 96 (or 128 with batch_size=32)
  • Steps per epoch: 2,999,000 Γ· 96 = 31,240 steps
    • With batch_size=32: 2,999,000 Γ· 128 = 23,429 steps
  • For 1 epoch: 31,240 steps total
  • For 2 epochs: 62,480 steps total

Checkpoints:

  • Saved every 5,000 steps
  • Checkpoint locations: steps 5000, 10000, 15000, 20000, 25000, 30000, 31240 (final)
  • Each checkpoint: ~2.5GB (ControlNet weights)
  • Total storage: ~20GB for all checkpoints + training state

VRAM Usage (H100 80GB)

With batch_size=24:

  • SDXL base model (FP16): ~6-7GB
  • ControlNet model: ~2.5GB
  • 8-bit Adam optimizer: ~3-4GB
  • Gradients (with checkpointing): ~3-4GB
  • Activations (batch 24): ~15-20GB
  • Total: ~35-40GB βœ… Fits comfortably in 80GB

With batch_size=32 (max):

  • Activations increase to ~20-25GB
  • Total: ~42-48GB βœ… Still fits with headroom

Recommended: Start with batch_size=24, monitor VRAM in W&B, can increase to 32 if using <60GB.

Risk Mitigation for Long Training

Strategy 1: Incremental Training

# Start with 500k samples to validate approach
--max_train_samples=500000
# Cost: ~$150, Duration: ~75 hours
# If results good, continue to full 3M

Strategy 2: Early Checkpoint Evaluation

# Evaluate quality at checkpoints:
# - checkpoint-5000  (~480k samples, ~32 hours, ~$63)
# - checkpoint-10000 (~960k samples, ~64 hours, ~$127)
# - checkpoint-15000 (~1.4M samples, ~96 hours, ~$191)
# Can stop early if quality plateaus

Strategy 3: Use Spot Instances

  • Many cloud providers offer H100 spot instances at 50-70% discount
  • Cost could drop to $0.60-$1.00/hr (~$270-$600 total)
  • Requires --resume_from_checkpoint="latest" (already included)
  • Risk: Training may be interrupted, but will resume automatically

When to Use Full 3M Training

Use 99k samples if:

  • βœ… First time training ControlNet
  • βœ… Testing hyperparameters
  • βœ… Budget constrained (<$50)
  • βœ… Need results quickly (1-2 days)

Use 3M samples if:

  • βœ… 99k results are good but want better quality
  • βœ… Commercial production use (worth the investment)
  • βœ… Training other ControlNet types (can reuse knowledge)
  • βœ… Contributing to research/community (publishable results)
  • βœ… Budget allows ($900-$1,200)

Phase 4: Training Monitoring

Setup Weights & Biases:

wandb login
# Use wandb to track:
# - Loss curves
# - Validation images every 500 steps
# - Learning rate schedule
# - GPU utilization

Checkpoints:

  • Saved every 1,500 steps to $OUTPUT_DIR/checkpoint-{step}
  • With ~3,094 total steps, will get checkpoints at:
    • checkpoint-1500 (~97% of epoch 1)
    • checkpoint-3000 (~94% of epoch 2)
    • Final model at end of training
  • Can resume training if interrupted: --resume_from_checkpoint="./controlnet-brightness-sdxl/checkpoint-1500"

Validation:

  • Uses 1,000 validation samples from train[99000:100000]
  • Runs every 1,500 steps (at checkpoints)
  • W&B logs validation images and metrics
  • No need for manual validation prompts/images

Validation Metrics (Automatic)

No configuration needed! The training script automatically computes validation metrics:

Loss Function (Automatic):

  • Default: MSE (Mean Squared Error) loss between predicted and target images
  • Optional: Huber loss - add --loss_type="huber" to training command
  • Formula: loss = F.mse_loss(model_pred.float(), target.float())

What Gets Logged to W&B:

  1. Training loss (every step)
  2. Validation loss (every --validation_steps=1500 steps)
  3. Validation images (generated samples at validation time)
  4. Learning rate (schedule tracking)
  5. GPU utilization (hardware monitoring)

Validation Process:

  1. Every 1500 steps, training pauses
  2. Model generates images from validation set
  3. Same MSE/Huber loss computed on validation samples
  4. Loss + images logged to W&B
  5. Training resumes

No manual metrics needed - everything is handled by the training script!

Phase 5: Model Evaluation & Publishing

Test Inference:

First, install QR code library if needed:

pip install qrcode[pil]

Then run inference:

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
import torch
import qrcode
from PIL import Image

# Generate QR code for testing
print("Generating QR code for https://google.com...")
qr = qrcode.QRCode(
    version=1,
    error_correction=qrcode.constants.ERROR_CORRECT_H,
    box_size=10,
    border=4,
)
qr.add_data("https://google.com")
qr.make(fit=True)

# Create QR code image and resize to 1024x1024
qr_image = qr.make_image(fill_color="black", back_color="white")
qr_image = qr_image.resize((1024, 1024), Image.LANCZOS)
print(f"QR code generated: {qr_image.size}")

# Load trained ControlNet
print("Loading ControlNet model...")
controlnet = ControlNetModel.from_pretrained(
    "./controlnet-brightness-sdxl/checkpoint-3000",  # or checkpoint-1500
    torch_dtype=torch.float16
)

# Load SDXL pipeline with ControlNet
print("Loading SDXL pipeline...")
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    torch_dtype=torch.float16
)
pipe.enable_xformers_memory_efficient_attention()
pipe.to("cuda")

# Generate artistic QR code
print("Generating artistic QR code...")
image = pipe(
    prompt="a beautiful garden scene with flowers, highly detailed, professional photography",
    negative_prompt="blurry, low quality, distorted",
    image=qr_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.45,  # Adjust 0.3-0.6 for balance
    guidance_scale=7.5,
).images[0]

# Save results
qr_image.save("original_qr.png")
image.save("artistic_qr_result.png")
print("βœ… Done! Check artistic_qr_result.png")
print("πŸ“± Scan with phone to verify QR code still works!")

Testing Different Conditioning Scales:

# Test multiple conditioning scales to find best balance
for scale in [0.3, 0.4, 0.5, 0.6]:
    print(f"Testing conditioning_scale={scale}...")
    image = pipe(
        prompt="a beautiful garden scene with flowers",
        image=qr_image,
        num_inference_steps=30,
        controlnet_conditioning_scale=scale,
    ).images[0]
    image.save(f"result_scale_{scale}.png")

Publish to HuggingFace Hub:

# After validation
huggingface-cli login
python scripts/upload_to_hub.py \
  --model_path="./controlnet-brightness-sdxl/checkpoint-50000" \
  --repo_name="Oysiyl/controlnet-brightness-sdxl"

Cost-Benefit Analysis

Investment Required (Updated for Single H100)

Strategy A: Free Tier (99k Quick Test)

Component Cost/Time
GPU Credits (99k samples, 2 epochs, single H100) $1.88
Setup Time 1-2 hours
Training Duration 45 minutes ⚑
Testing & Validation 2-3 hours
Total Time ~4-6 hours (same day)
Total Cost $1.88

Strategy B: Pro Plan (Full 3M Training)

Component Cost/Time
Pro Subscription (can cancel after) $20/month
Included credits value -$13 (240 credits/year)
GPU Credits (3M samples, 1 epoch, 6Γ—H100) $60
Setup Time 1-2 hours
Training Duration 4 hours ⚑
Testing & Validation 2-3 hours
Total Time ~8 hours (same day)
Total Cost $80 ($20 sub + $60 GPU)
Net Cost $67 (after annual credit value)

Strategy C: All-in-One (Pro Plan, Test Everything)

Component Cost/Time
Pro Subscription $20/month
99k test (6Γ—H100) $1.88 (7.5 min)
500k training (6Γ—H100) $10 (40 min)
3M training (6Γ—H100) $60 (4 hours)
Total GPU Time ~5 hours
Total GPU Cost $71.88
Total with Sub $91.88
Net after credits $78.88

Recommendation: Start with Strategy A ($1.88), upgrade to Strategy B if promising

Value Delivered

  1. Unblocks SDXL Migration: Enables upgrade from SD 1.5 to higher quality SDXL
  2. Better Image Quality: SDXL produces superior 1024Γ—1024 images vs SD 1.5's 512Γ—512
  3. Community Value: First public SDXL brightness ControlNet (potential citations/recognition)
  4. No Alternatives: Cannot proceed with SDXL QR code generation without this model
  5. Reusable Asset: Once trained, can be used indefinitely

Risk Mitigation

  • Start Small: Train on 100k samples first (~$40, 1-2 days)
  • Evaluate Early: Check quality at checkpoint-5000, checkpoint-10000
  • Iterative Approach: Extend training only if initial results are promising
  • Fallback: Can continue using SD 1.5 if SDXL training fails

Alternative Approaches Considered

Option 1: Train Brightness ControlNet for SDXL (RECOMMENDED)

  • Pros:
    • Proven training pipeline (diffusers script exists)
    • Same dataset as original SD 1.5 model
    • Good quality/cost balance
    • Community support and documentation
    • License-friendly (SDXL is permissive)
  • Cons:
    • Requires GPU time investment ($75-$300)
    • 4-5 days training duration
    • Still requires 24GB+ VRAM for inference
  • Cost: $155 for 500k samples on A100 (recommended)
  • Risk: Low - well-documented process
  • Verdict: βœ… Best choice for production use

Option 2: Train Brightness ControlNet for Flux Schnell

  • Pros:
    • Apache 2.0 license (fully commercial)
    • Faster inference than Flux Dev (3Γ— speedup)
    • Same architecture as Dev (12B parameters)
    • Would be first-of-its-kind community contribution
  • Cons:
    • ⚠️ No existing training scripts for Schnell
    • Would need to adapt Flux Dev training code
    • Unknown if distillation affects ControlNet training
    • Still requires 32-40GB VRAM (heavier than SDXL)
    • Higher risk and uncertainty
    • Longer training time due to larger model
  • Cost: $200-$500 (estimated, higher due to larger model)
  • Risk: High - experimental, no precedent
  • Verdict: πŸ”¬ Experimental - only if willing to pioneer new territory

Option 3: Use SDXL LoRA for Brightness Control

  • Pros: No training required, immediate availability
  • Cons: Less precise control than dedicated ControlNet, may not work well for QR codes
  • Verdict: Worth testing but likely insufficient for QR code use case

Option 4: Latent Initialization Approach

  • Pros: Architecture-agnostic, works with both SDXL and Flux
  • Cons: Less control over brightness distribution, requires experimentation
  • Verdict: Good fallback but not as reliable as ControlNet

Option 5: Wait for Community Release

  • Pros: Zero cost, zero effort
  • Cons: No timeline, may never happen, blocks project progress
  • Verdict: Not viable for active development

Option 6: Hybrid Tile ControlNet + Post-Processing

  • Pros: Tile ControlNet available for SDXL
  • Cons: Doesn't address brightness control directly
  • Verdict: Complementary but not a replacement

Conclusion: Training SDXL ControlNet is the most reliable solution. Flux Schnell is interesting for research but carries significant execution risk.

Recommended Action Plan

Immediate Setup (Day 1)

  1. Launch Lightning AI Instance: A100 40GB GPU
  2. Run Setup Commands: Install all dependencies (see Phase 3 above)
  3. Authenticate: HuggingFace and W&B login
  4. Clone Diffusers: Get training scripts

Training Phase (Day 1 - Morning) ⚑

  1. Start Training: Launch training with 99k samples (~45 minutes on 8Γ—H100)
  2. Monitor W&B: Track loss curves and validation images in real-time
  3. First Checkpoint: Review checkpoint-1500 (~25 minutes in)
  4. Training Complete: Total ~45 minutes for full 2-epoch run

Evaluation Phase (Day 1 - Afternoon)

  1. Post-Training Validation: Run inference on 1k validation set
  2. QR Code Testing: Test with actual QR codes, measure scannability
  3. Quality Assessment: Compare to SD 1.5 brightness ControlNet
  4. Decision Point:
    • If quality good: Publish and integrate (move to next phase)
    • If needs improvement: Launch 2nd training run with adjusted hyperparameters (~45 min)
    • Can try 3-4 different configurations in same day!

Optional: Full Dataset Training (Day 1 - Evening)

12a. If 99k results promising: Launch full 3M training (~2 hours on 8Γ—H100) 12b. Monitor overnight: W&B tracks progress automatically 12c. Next morning: Evaluate final model quality

Integration Phase (Day 2)

  1. Publish to HuggingFace: Upload best checkpoint
  2. Update app_sdxl.py: Integrate new ControlNet model
  3. Production Testing: End-to-end QR code generation tests
  4. Documentation: Update README with SDXL support

Total Timeline: 1-2 days (vs previous estimate of 5 days)

Success Metrics

  1. QR Code Scannability: 95%+ scan rate on generated images
  2. Visual Quality: Subjective improvement over SD 1.5 outputs
  3. Control Precision: Ability to adjust brightness strength (0.0-1.0 range)
  4. Training Loss: Convergence to < 0.1 validation loss
  5. Community Adoption: Positive feedback if published publicly

Critical Files to Modify

Once model is trained:

  • app.py:48-56 - Add SDXL ControlNet loading
  • app.py:1880-1886 - Update standard pipeline with SDXL support
  • app.py:2343-2349 - Update artistic pipeline with SDXL support
  • app_sdxl.py - Complete SDXL-specific implementation
  • comfy/sd_configs/ - Add SDXL configuration if needed

Flux Schnell Training Considerations (If Pursuing)

If you decide to pursue Flux Schnell ControlNet training despite the risks:

Required Adaptations:

  1. Training Script Modification: Adapt train_controlnet_flux.py to work with Schnell

    • Model path: black-forest-labs/FLUX.1-schnell instead of FLUX.1-dev
    • Verify architecture compatibility (distillation may affect ControlNet layers)
    • Test with small pilot run (1000 steps) before full training
  2. Hardware Requirements:

    • Minimum: H100 (80GB VRAM) - $1.99/hr
    • A100 40GB likely insufficient for Flux training
    • Estimated training: 150-250 hours on H100 (~$300-$500)
  3. Dataset Considerations:

    • Flux uses 1024Γ—1024 resolution (same as SDXL)
    • Dataset would need upscaling from 512Γ—512 or re-preprocessing
    • Consider starting with 100k subset for validation
  4. Verification Steps:

    • Test if Schnell's distillation preserves ControlNet training capability
    • Compare with Flux Dev training (if available for testing)
    • Validate brightness control precision matches SD 1.5 quality

Risk Assessment:

  • Technical Risk: High - no proven training path
  • Time Risk: Medium-High - debugging could extend timeline significantly
  • Cost Risk: High - may require multiple training attempts ($500+)
  • Success Probability: 50-70% (educated guess based on architecture similarity)

Recommendation: Only pursue if:

  1. SDXL training completes successfully first (de-risk approach)
  2. You're willing to contribute pioneering work to the community
  3. Budget allows for experimental work ($500-1000 total including failed attempts)

References

SDXL Training

Flux Information

Final Recommendation (Updated December 2024 - Lightning.ai)

Proceed with SDXL Brightness ControlNet Training on Single H100 (Free Tier)

Based on Lightning.ai pricing and multi-GPU requirements, the recommended path is:

Phase 1: Quick Validation (Free Tier)

  1. Start with 99k samples on single H100
    • Cost: $1.88 in GPU credits
    • Duration: 45 minutes
    • Platform: Lightning.ai Free tier
    • Purpose: Validate training pipeline and quality

Phase 2: Production Training (Choose Based on Phase 1)

Option A: Budget Approach (Free Tier)

  • Run full 3M dataset on single H100
  • Cost: $60 GPU credits, $0 subscription
  • Duration: 24 hours
  • Total: $60
  • Best for: One-time training, have patience

Option B: Speed Approach (Pro Plan)

  • Upgrade to Pro plan ($20/month annual)
  • Run full 3M dataset on 6Γ— H100
  • Cost: $60 GPU + $20 subscription = $80
  • Net cost: $67 (after $13 annual credit value)
  • Duration: 4 hours
  • Best for: Need results same day, may iterate

Recommended Strategy

Most Cost-Effective Path:

  1. Day 1 Morning: Run 99k test on Free tier ($1.88, 45 min)
  2. Day 1 Afternoon: Evaluate results
  3. If promising:
    • Budget route: Start 3M on Free tier ($60, 24 hrs) β†’ Total: $61.88
    • Speed route: Upgrade to Pro, run 3M ($80, 4 hrs) β†’ Total: $81.88
  4. Cancel Pro after training if using speed route

Why This Path

  • Low Risk Entry: Only $1.88 to validate entire pipeline
  • Flexible Scaling: Choose speed vs cost based on results
  • Proven Pipeline: HuggingFace Diffusers battle-tested script
  • Reference Success: Original SD 1.5 model trained on same dataset
  • H100 Advantage: 6.3Γ— faster than A100 even on single GPU
  • Cost-Effective: $62-$82 total (vs $900+ on older plans)
  • Unblocks Migration: Enables full SDXL upgrade from SD 1.5

Cost Breakdown Comparison

Approach Hardware Duration GPU Cost Sub Cost Total Timeline
Old Plan (A100) Single A100 180 hours $900-1,200 $0 $900-1,200 1 week
NEW: Free Tier Single H100 24.75 hours $61.88 $0 $61.88 2 days
NEW: Pro Plan 6Γ— H100 4.75 hours $61.88 $20 $81.88 1 day

Savings vs Old Plan:

  • Free tier: Save $838-$1,138 and 6 days
  • Pro plan: Save $818-$1,118 and 6 days

Pro Plan ROI Analysis

When is Pro worth it?

  • $20 extra to save 20 hours (24h β†’ 4h)
  • = $1/hour saved
  • Plus: Can test multiple hyperparameters same day
  • Plus: Includes $13/year in credits

Get Pro if:

  • βœ… You value time over $1/hour
  • βœ… Planning to iterate on hyperparameters
  • βœ… Need results urgently
  • βœ… Want to test 99k + 500k + 3M in one session

Skip Pro if:

  • βœ… Doing one-time training only
  • βœ… Can wait 24 hours
  • βœ… Budget constrained
  • βœ… 99k test was sufficient

Next Steps

Once plan is approved:

  1. Set up Lightning AI account with A100 GPU access
  2. Clone diffusers repository and install requirements
  3. Verify dataset access and download capabilities
  4. Prepare validation QR codes for quality testing
  5. Launch training with recommended hyperparameters
  6. Monitor via Weights & Biases for loss curves and validation images
  7. Evaluate checkpoints at 10k, 25k, 50k steps
  8. Complete training and publish to HuggingFace Hub
  9. Integrate into app_sdxl.py for production use

Flux Schnell remains an option for future exploration once SDXL is production-ready, but is deprioritized due to experimental nature and higher resource requirements.