AI-QR-code-generator

Sleeping

App Files Files Community

AI-QR-code-generator / SDXL_ControlNet_Brightness_Training_Plan.md

Oysiyl

Add comprehensive accelerate and H100 optimization guide

82b2bf3 about 1 month ago

preview code

raw

history blame contribute delete

59 kB

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Training ControlNet Brightness for SDXL - Feasibility Analysis

Executive Summary

Training a brightness ControlNet for SDXL is technically feasible and recommended as the critical upgrade path from SD 1.5 to SDXL for QR code generation. This model is essential because no public SDXL brightness ControlNet exists.

Key Estimates (Updated December 2024 - Single H100 GPU):

Time: 45 minutes (99k samples) to 24 hours (3M samples) on single H100
Cost: $13 (99k) to $418 (3M) in GPU credits
Platform: Lightning.ai with optional Pro plan ($20/month for multi-GPU)
Priority: High - enables SDXL migration for QR code generation
Complexity: Medium - well-documented training pipeline with reference implementation

Recommended Path:

Start with single H100 for 99k samples (~45 min, $13)
If successful, optionally upgrade to Pro plan for faster 3M training
Total investment: $13-$138 depending on training size and plan choice

Background Context

Current Implementation (SD 1.5)

Location: app.py:1880-1886, 2343-2349
Model: control_v1p_sd15_brightness.safetensors from latentcat/latentcat-controlnet
Purpose: Controls QR code pattern visibility via brightness conditioning
Critical: Essential for QR code readability - cannot be removed

Why SDXL Brightness ControlNet is Needed

No Public Alternative: No SDXL-equivalent brightness ControlNet exists on HuggingFace
Migration Blocker: Current SD 1.5 brightness ControlNet incompatible with SDXL architecture
QR Readability: Brightness control is core to balancing aesthetic quality with QR scannability
Flux is Too Heavy: SDXL is the practical upgrade path (Flux requires 32-40GB VRAM)

Flux Model Landscape (Updated Analysis)

Flux Schnell (Apache 2.0 License)

License: Fully open for commercial use - no restrictions
Architecture: Same 12B parameters as Flux Dev, but distilled for speed (3× faster)
Quality: Lower than Dev due to aggressive distillation trading detail for speed
VRAM: Still requires 32-40GB (same as Dev)
ControlNet Status: ⚠️ No existing ControlNet models or training scripts
Training Risk: Would require adapting Flux Dev training script - pioneering work
Community: Active requests for Schnell ControlNets but no official releases

Flux Dev (Non-Commercial License)

License: Non-commercial only - cannot be used for commercial QR code generation
ControlNet Status: ✅ Extensive support (XLabs-AI, InstantX collections)
Training Scripts: Available from XLabs-AI and HuggingFace Diffusers
Quality: Superior to Schnell, but license restrictions make it unsuitable

Flux Pro (Commercial API)

License: API-only, commercial pricing
Status: Not suitable for self-hosted training

Assessment: While Flux Schnell has an attractive license, the lack of proven ControlNet training pipeline makes it high-risk. SDXL remains the proven, practical choice.

Hardware Selection & Platform Strategy

Lightning.ai Pricing Tiers (December 2024)

Lightning.ai offers different tiers with varying multi-GPU capabilities:

Plan	Cost	Multi-GPU	Max GPUs	Credits Included	Best For
Free	$0	❌ No	1	15/month	Quick 99k test
Pro	$20/month (annual)	✅ Yes	6	240/year (~$13/mo)	Recommended
Teams	$119/month (annual)	✅ Yes	12	600/year	Large teams

Pro Plan Benefits:

Only $20/month if paid annually ($240/year vs $600 monthly)
Includes 240 credits/year = ~$13 of free GPU time
Net cost: ~$7/month after credits
Multi-GPU training up to 6 GPUs
Can cancel after training completes

GPU Comparison Analysis (Lightning.ai)

Single GPU Performance:

GPU	TFLOPs	Memory	Cost/hr	99k Time	99k Cost	3M Time	3M Cost
A100	312	40GB	~$1.50	4-6 hours	$6-9	120-180 hours	$180-270
H100	1979	80GB	~$2.50	45 min	$1.88	24 hours	$60

Cost Efficiency:

H100 is 6.3× faster than A100 (1979 vs 312 TFLOPs)
H100 costs 1.67× more per hour on Lightning.ai
Net result: 3.8× better cost efficiency

Single vs Multi-GPU: Should You Get Pro Plan?

Option A: Free Plan (Single H100)

Training Size	Duration	GPU Cost	Total Cost	Timeline
99k samples	45 min	$1.88	$1.88	Same day
500k samples	4 hours	$10	$10	Same day
3M samples	24 hours	$60	$60	1-2 days

Pros:

✅ $0 subscription cost
✅ Very cheap for 99k testing
✅ Good for one-off training

Cons:

❌ 24 hours for 3M training (must babysit)
❌ Can't test multiple hyperparameters quickly
❌ Limited to 15 free credits/month

Option B: Pro Plan (6× H100)

Training Size	Duration	GPU Cost	Subscription	Total Cost	Timeline
99k samples	7.5 min	$1.88	$20	$21.88	Minutes
500k samples	40 min	$10	$20	$30	Same hour
3M samples	4 hours	$60	$20	$80	Same day

Multi-GPU costs same because:

6× GPUs = 6× faster
6× GPUs = 6× more expensive per hour
Net: Same total GPU cost, much faster completion

Pros:

✅ 3M training finishes in 4 hours (vs 24)
✅ Can test 3-4 hyperparameter configs in one day
✅ Includes 240 credits/year (~$13 value)
✅ Real net cost: $7/month after credits
✅ Can cancel after training done

Cons:

❌ $20 upfront cost (annual commitment)

Recommendation Matrix

If you're doing ONE 99k training run:

✅ Use Free tier ($1.88 total, 45 min)
Skip Pro plan - not worth $20 for 7.5 min vs 45 min

If you're doing 500k OR 3M training:

✅ Get Pro plan ($20/month)
3M: 4 hours vs 24 hours = worth it
Can test multiple configs same day
Net cost after credits: ~$7/month

If you're doing multiple experiments:

✅ Definitely get Pro plan
Test 99k + 500k + 3M all in one day
Total time: ~5 hours vs 30+ hours
Total cost: $20 + ~$72 GPU = $92
Cancel Pro after training complete

Most Cost-Effective Strategy:

Start with Free tier for 99k test ($1.88, 45 min)
If results promising, upgrade to Pro for 3M training
Run full training in 4 hours
Cancel Pro after done
Total: $20 Pro + $60 GPU + $1.88 test = $81.88

Updated Training Timeline Estimates

Single H100 (Free Tier):

Training Size	Duration	Total Cost	When to Use
99k samples	45 min	$1.88	Quick validation, hyperparameter testing
500k samples	4 hours	$10	Medium quality, budget option
3M samples	24 hours	$60	Max quality, have patience

6× H100 (Pro Plan at $20/month):

Training Size	Duration	Total Cost	When to Use
99k samples	7.5 min	$21.88	Ultra-fast iteration
500k samples	40 min	$30	Production ready, same day
3M samples	4 hours	$80	Best quality, same day results

Training Strategy

Dataset: latentcat/grayscale_image_aesthetic_3M

Size: 3 million images at 512×512 resolution
Format: Parquet files with image/conditioning_image/text columns
Same Dataset: Used for original SD 1.5 brightness ControlNet training
License: Latent Cat (check license before commercial use)
Quality: Pre-processed grayscale images with aesthetic filtering

Reference Training Results (from latentcat article)

Configuration	Samples	Hardware	Duration	Cost Estimate
Original SD 1.5	100k	A6000	13 hours	~$20 (est.)
Original SD 1.5	3M	TPU v4-8	25 hours	N/A (TPU)

SDXL Training Scaling Estimates

Updated Based on Latentcat Article:

Training at 512×512 resolution (NOT 1024×1024) - matches dataset and original training
SDXL has larger UNet architecture (~2.5GB vs 1.7GB for SD 1.5)
Expected slowdown: 2-3× compared to SD 1.5 training

Time Estimates for 99k Training Samples (Lightning.ai Single H100):

Calculation Methodology

Baseline Reference:

Latentcat article: 100k samples on A6000 = 13 hours (SD 1.5)
SDXL overhead: 13h × 2.5 (larger architecture) = ~32.5 hours for 100k
A6000 ≈ A100 in performance (~300-312 TFLOPs)

Scaling to H100:

A100: 312 TFLOPs → ~4-6 hours for 99k samples
H100: 1979 TFLOPs → 6.3× faster
H100 single GPU: ~38-57 minutes for 99k samples

Multi-GPU Scaling (Pro Plan):

6× H100 GPUs = 6× faster = ~7.5 minutes for 99k
Total cost stays same (6× faster but 6× more expensive/hour)

Recommended Configurations

🏆 OPTION 1: Free Tier (Single H100) - Best for Testing

99k samples: 45 min, $1.88
500k samples: 4 hours, $10
3M samples: 24 hours, $60
Best for: One-off training, budget-conscious, have patience

🚀 OPTION 2: Pro Plan (6× H100) - Best for Production

Subscription: $20/month (annual), includes $13 credits = $7 net cost
99k samples: 7.5 min, $21.88 total ($1.88 GPU + $20 sub)
500k samples: 40 min, $30 total ($10 GPU + $20 sub)
3M samples: 4 hours, $80 total ($60 GPU + $20 sub)
Best for: Multiple experiments, 3M training, need results same day

Cost Comparison Summary:

Scenario	Free Tier	Pro Plan	Savings (Pro)
Single 99k test	$1.88	$21.88	❌ $20 more
Single 3M training	$60	$80	❌ $20 more
99k + 500k + 3M	$71.88 (30 hours)	$92 (5 hours)	✅ Save 25 hours
3+ experiments	$71.88+ (30+ hours)	$92 (5-6 hours)	✅ Save 24+ hours

Recommendation:

For single 99k test: Use Free Tier (not worth $20 for speed)
For 3M training: Consider Pro (4 hrs vs 24 hrs = big difference)
For multiple runs: Definitely Pro (can test everything in one day)

Technical Implementation Plan

Dataset Verification Script

Create this script to verify dataset before training:

cat > verify_dataset.py << 'EOF'
#!/usr/bin/env python3
"""
Dataset verification script for SDXL ControlNet Brightness training.
Downloads a subset of the dataset and verifies structure.

Usage: python verify_dataset.py
"""

from datasets import load_dataset
from PIL import Image
import sys

def verify_dataset():
    print("=" * 60)
    print("SDXL ControlNet Brightness - Dataset Verification")
    print("=" * 60)

    print("\n[1/4] Loading dataset subset (99k samples)...")
    print("This will download ~10-15GB to cache...")

    try:
        train_dataset = load_dataset(
            "latentcat/grayscale_image_aesthetic_3M",
            split="train[:99000]",
            cache_dir="~/.cache/huggingface/datasets"
        )
        print(f"✅ Successfully loaded {len(train_dataset)} samples")
    except Exception as e:
        print(f"❌ Failed to load dataset: {e}")
        sys.exit(1)

    print("\n[2/4] Verifying dataset structure...")
    expected_columns = {"image", "conditioning_image", "text"}
    actual_columns = set(train_dataset.column_names)

    if actual_columns == expected_columns:
        print(f"✅ Columns correct: {train_dataset.column_names}")
    else:
        print(f"❌ Column mismatch!")
        print(f"   Expected: {expected_columns}")
        print(f"   Got: {actual_columns}")
        sys.exit(1)

    print("\n[3/4] Checking sample data...")
    sample = train_dataset[0]

    # Check images
    if isinstance(sample['image'], Image.Image):
        img_size = sample['image'].size
        print(f"✅ Image type: PIL.Image, size: {img_size}")
    else:
        print(f"❌ Unexpected image type: {type(sample['image'])}")

    if isinstance(sample['conditioning_image'], Image.Image):
        cond_size = sample['conditioning_image'].size
        print(f"✅ Conditioning image type: PIL.Image, size: {cond_size}")
    else:
        print(f"❌ Unexpected conditioning image type: {type(sample['conditioning_image'])}")

    if isinstance(sample['text'], str):
        caption_len = len(sample['text'])
        print(f"✅ Caption type: str, length: {caption_len} chars")
        print(f"   Sample caption: '{sample['text'][:100]}...'")
    else:
        print(f"❌ Unexpected caption type: {type(sample['text'])}")

    print("\n[4/4] Checking validation split (last 1000 samples)...")
    try:
        # IMPORTANT: Always use last 1000 samples for validation
        # This ensures consistent validation across all training sizes
        val_dataset = load_dataset(
            "latentcat/grayscale_image_aesthetic_3M",
            split="train[2999000:3000000]",
            cache_dir="~/.cache/huggingface/datasets"
        )
        print(f"✅ Validation split loaded: {len(val_dataset)} samples")
        print(f"   Validation uses: train[2999000:3000000] (last 1k)")
    except Exception as e:
        print(f"❌ Failed to load validation split: {e}")
        sys.exit(1)

    print("\n" + "=" * 60)
    print("✅ ALL CHECKS PASSED!")
    print("=" * 60)
    print(f"\nDataset cached at: ~/.cache/huggingface/datasets/")
    print(f"Training samples: {len(train_dataset)}")
    print(f"Validation samples: {len(val_dataset)}")
    print(f"\n⚠️  IMPORTANT: Validation always uses samples 2,999,000-2,999,999")
    print(f"   This ensures consistent validation across all training sizes")
    print(f"   (99k, 500k, 3M all use same validation set)")
    print(f"\nYou can now proceed with training!")
    print("The training script will automatically use this cached data.")

if __name__ == "__main__":
    verify_dataset()
EOF

Make executable and run:

chmod +x verify_dataset.py
python verify_dataset.py

Expected output: Should confirm dataset structure and cache the first 100k samples.

Manual Preparation Checklist (Do This First!)

Split into two phases to minimize GPU costs:

Part A: Local Preparation (BEFORE Launching GPU Instance)

Do these steps on your local machine or any CPU instance - no GPU needed, $0 cost:

Step 1: Get Your Authentication Tokens

Prepare these before launching GPU:

HuggingFace token: https://huggingface.co/settings/tokens (create "Read" access token)
W&B API key: https://wandb.ai/authorize

Save these somewhere - you'll need them on the GPU instance.

Step 2: Prepare Dataset Verification Script Locally

The full verify_dataset.py script is provided in the "Dataset Verification Script" section above (under Technical Implementation Plan).

You can either:

Copy that script to a file on your local machine, OR
Recreate it directly on the GPU instance in Part B below

No need to prepare this locally if you prefer to create it on the GPU instance.

Part B: GPU Instance Setup (AFTER Launching GPU, BEFORE Training)

Complete these steps on your GPU instance to avoid wasting GPU credits on training failures:

Estimated time: 30-60 minutes (mostly dataset download) GPU credits used: ~$0.75-$1.50 (30-60 min @ $1.55/hr for A100)

Step 1: System Dependencies

# Update system packages
sudo apt-get update && sudo apt-get install -y git git-lfs build-essential

# Initialize Git LFS
git lfs install

Step 2: Python Environment with CUDA

# Install PyTorch with CUDA 11.8 (requires GPU instance!)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install core ML libraries
pip install diffusers transformers accelerate datasets

# Install utilities
pip install huggingface_hub pillow wandb xformers bitsandbytes

Step 3: Verify CUDA (Critical!)

# Verify CUDA availability - MUST show "True"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0)}')"

Expected output:

CUDA available: True
CUDA version: 11.8
GPU: NVIDIA A100-SXM4-40GB

If CUDA shows False: Stop and troubleshoot before proceeding!

Step 4: Clone Training Repository

# Clone HuggingFace diffusers
git clone https://github.com/huggingface/diffusers.git
cd diffusers/examples/controlnet

# Verify training script exists
ls -la train_controlnet_sdxl.py  # Should show the file

Step 5: Authentication Setup

# Login to HuggingFace (use token from Part A)
huggingface-cli login
# Paste your token when prompted

# Login to Weights & Biases (use API key from Part A)
wandb login
# Paste your API key when prompted

Step 6: Dataset Verification (CRITICAL!)

# Create the verify_dataset.py script using the code from
# "Dataset Verification Script" section at the top of this plan
# (See lines after "Technical Implementation Plan" heading)

# Once created, run it:
chmod +x verify_dataset.py
python verify_dataset.py

Expected output: ```

SDXL ControlNet Brightness - Dataset Verification

[1/4] Loading dataset subset (99k samples)... This will download ~10-15GB to cache... ✅ Successfully loaded 99000 samples

[2/4] Verifying dataset structure... ✅ Columns correct: ['image', 'conditioning_image', 'text']

[3/4] Checking sample data... ✅ Image type: PIL.Image, size: (512, 512) ✅ Conditioning image type: PIL.Image, size: (512, 512) ✅ Caption type: str, length: 87 chars

[4/4] Checking validation split (last 1000 samples)... ✅ Validation split loaded: 1000 samples Validation uses: train[2999000:3000000] (last 1k)

============================================================ ✅ ALL CHECKS PASSED!

Dataset cached at: ~/.cache/huggingface/datasets/ Training samples: 99000 Validation samples: 1000

⚠️ IMPORTANT: Validation always uses samples 2,999,000-2,999,999 This ensures consistent validation across all training sizes (99k, 500k, 3M all use same validation set)

You can now proceed with training!


#### Step 7: Pre-Flight Verification
```bash
# Check all packages are installed
pip list | grep -E "torch|diffusers|transformers|accelerate|datasets|xformers"

# Check disk space (need ~20GB free for checkpoints)
df -h ~

# Verify dataset cache exists
ls -lh ~/.cache/huggingface/datasets/

Step 8: Create Output Directory

# Create directory for training outputs
mkdir -p ~/controlnet-brightness-sdxl

# Return to training directory
cd ~/diffusers/examples/controlnet

✅ Preparation Complete!

Once all Part B steps pass, you're ready to start GPU training.

The training command (shown in Phase 3 below) will now:

✅ Use pre-downloaded dataset from cache (no re-download)
✅ Have all required libraries installed with CUDA support
✅ Be authenticated to HuggingFace and W&B
✅ Save checkpoints to the prepared directory

Total preparation cost: ~$0.75-$1.50 (vs $35 for full training) Why worth it: Catches setup issues early without wasting 25 hours of GPU time

Hardware Selection (Updated for Lightning.ai):

🏆 RECOMMENDED FOR TESTING: Single H100 on Free Tier
- 99k training in 45 min for $1.88
- Perfect for validation and hyperparameter tuning
- 80GB VRAM allows good batch sizes
- No subscription required
🚀 RECOMMENDED FOR PRODUCTION: 6× H100 on Pro Plan ($20/month annual)
- 3M training in 4 hours for $80 total
- Can test multiple configs in one day
- Net cost: ~$7/month after included credits
- Cancel subscription after training complete
Not Recommended: A100 - H100 is faster and more cost-efficient

Phase 2: Dataset Preparation

Dataset Split Strategy (for 99k quick training):

Training: 99,000 samples (split="train[:99000]")
Validation: 1,000 samples (split="train[2999000:3000000]") - ALWAYS last 1k
Total loaded: 100,000 samples (99k + last 1k of 3M dataset)

⚠️ CRITICAL: Validation Always Uses Last 1000 Samples

All training sizes (99k, 500k, 3M) use train[2999000:3000000] for validation
This ensures consistent validation set across all training runs
Allows fair comparison of model quality at different training stages
No overlap between training and validation for any training size

Why This Matters:

❌ WRONG: Using different validation sets for different training sizes
   - 99k training:  train[:99000] + validation train[99000:100000]
   - 500k training: train[:499000] + validation train[499000:500000]
   - 3M training:   train[:2999000] + validation train[2999000:3000000]
   Problem: Can't compare results! Each uses different validation data.

✅ CORRECT: Same validation set for all training sizes
   - 99k training:  train[:99000] + validation train[2999000:3000000]
   - 500k training: train[:499000] + validation train[2999000:3000000]
   - 3M training:   train[:2999000] + validation train[2999000:3000000]
   Benefit: Fair comparison across all training runs on same validation set.

Understanding HuggingFace Dataset Caching

Important: The HuggingFace datasets library automatically caches all downloads to ~/.cache/huggingface/datasets/. This means:

✅ Cache reuse is automatic: When the training script runs, it will check the cache first and reuse any previously downloaded data ✅ No re-downloads: You won't download the full 3M dataset if you've already downloaded a subset ✅ The pre-download step is OPTIONAL: The training command can handle downloading on its own

Pre-download Benefits:

Verify dataset structure before training starts
Separate download time from training time
Ensure dataset access works before committing GPU hours

Pre-download is NOT required: The training script's --max_train_samples=99000 parameter will work whether you pre-download or not.

Dataset Download Options

Option A: Pre-download for verification (RECOMMENDED)

from datasets import load_dataset

# This downloads and caches ~100k samples for verification
train_dataset = load_dataset(
    "latentcat/grayscale_image_aesthetic_3M",
    split="train[:99000]",
    cache_dir="~/.cache/huggingface/datasets"  # Default cache location
)

# Verify the dataset structure
print(f"Dataset size: {len(train_dataset)}")
print(f"Columns: {train_dataset.column_names}")
print(f"First sample keys: {train_dataset[0].keys()}")

# Check a sample
sample = train_dataset[0]
print(f"Image size: {sample['image'].size}")
print(f"Conditioning image size: {sample['conditioning_image'].size}")
print(f"Caption: {sample['text']}")

Option B: Let training script handle download

Simply run the training command with --dataset_name and --max_train_samples
The script will download to cache automatically
Slightly riskier if there are dataset access issues

Recommended: Use the full verify_dataset.py script (see "Dataset Verification Script" section above) which implements Option A with comprehensive validation checks.

Data Format Validation:

Verify columns: image, conditioning_image, text
Check image resolution: 512×512 (will be upscaled to 1024×1024 by script)
Validate grayscale format

Steps Calculation (IMPORTANT):

Training samples: 99,000
Batch size: 16
Gradient accumulation: 4
Effective batch size: 16 × 4 = 64 samples/step
Steps per epoch: 99,000 ÷ 64 = 1,547 steps
For 2 epochs: ~3,094 total steps

Phase 3: Training Configuration

Prerequisites: Complete the "Manual Preparation Checklist" above before running this command.

Training Command (Based on Latentcat Article):

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=99000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=16 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=2 \
  --checkpointing_steps=1500 \
  --validation_steps=1500 \
  --tracker_project_name="brightness-controlnet-sdxl" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam

Key Parameters Explained:

--max_train_samples=99000: Limit to 99k samples (reserves 1k for validation)
--resolution=512: Match dataset resolution (latentcat article used 512, not 1024)
--learning_rate=1e-5: From latentcat article
--train_batch_size=16: From latentcat article
--gradient_accumulation_steps=4: Effective batch = 16 × 4 = 64
--num_train_epochs=2: From latentcat article
--checkpointing_steps=1500: Save every 1500 STEPS (~once per epoch)
- Total training: ~3,094 steps for 2 epochs
- Checkpoints at: 1500, 3000 steps
--validation_steps=1500: Run validation every 1500 STEPS
--gradient_checkpointing: Reduces VRAM usage
--use_8bit_adam: Memory optimization
--enable_xformers_memory_efficient_attention: Memory-efficient attention

Critical Understanding - Steps vs Samples:

1 STEP = processing 1 effective batch = 64 samples
Checkpoint every 1500 steps = every 1500 × 64 = 96,000 samples (~1 epoch)
NOT checkpoint every 1500 samples!
Total steps for 2 epochs: 99,000 ÷ 64 × 2 = 3,094 steps

VRAM Requirements with These Settings:

The settings above are optimized for memory efficiency:

--mixed_precision="fp16": Halves memory usage
--gradient_checkpointing: Trades compute for memory (~40% VRAM savings)
--use_8bit_adam: Reduces optimizer state memory
--enable_xformers_memory_efficient_attention: Memory-efficient attention

Estimated VRAM usage:

SDXL base model (FP16): ~6-7GB
ControlNet model: ~2.5GB
8-bit Adam optimizer states: ~3-4GB
Gradients (with checkpointing): ~2-3GB
Activations (batch 16, 512×512, gradient checkpointing): ~8-12GB
Total: ~22-28GB peak

GPU Compatibility:

GPU	VRAM	Will It Fit?	Batch Size	Notes
L4	24GB	⚠️ Tight	8-12	Reduce `--train_batch_size` to 8 or 12
A100 40GB	40GB	✅ Yes	16	Recommended - comfortable fit
A100 80GB	80GB	✅ Yes	16-24	Plenty of headroom, can increase batch
H100 80GB	80GB	✅ Yes	16-24	Fastest training, plenty of VRAM

Recommended: A100 40GB - The settings will fit comfortably with batch size 16.

If using L4 24GB, modify the command:

# Change this line:
  --train_batch_size=16 \
# To:
  --train_batch_size=8 \

This keeps effective batch size = 8 × 4 = 32 (half of 64), but still works well.

Accelerate Configuration for Multi-GPU Training

Important: Multi-GPU training on Lightning.ai requires the Pro plan ($20/month annual).

Single GPU (Free Tier) - No Configuration Needed

For single GPU training on Free tier, accelerate launch works without any configuration:

# No accelerate config needed - auto-detects single GPU
accelerate launch train_controlnet_sdxl.py [args...]

Multi-GPU (Pro Plan) - Configure Before Training

For 6× H100 training on Pro plan, configure accelerate once:

# Run configuration wizard
accelerate config

Configuration Options for 6× H100:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU  # Use DataParallel for multiple GPUs
num_machines: 1  # Single machine with 6 GPUs
num_processes: 6  # One process per GPU
gpu_ids: all  # Use all available GPUs
mixed_precision: fp16  # Match training script
use_cpu: false
dynamo_backend: NO  # Disable torch.compile for compatibility

Quick Config (Non-Interactive):

# Create accelerate config file directly
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'EOF'
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 6
gpu_ids: all
mixed_precision: fp16
use_cpu: false
dynamo_backend: NO
EOF

Verify Configuration:

# Check configuration
accelerate env

# Test multi-GPU setup
accelerate test

Launch Multi-GPU Training:

# With configuration file, launch works same as single GPU
accelerate launch train_controlnet_sdxl.py [args...]

# Or specify config explicitly
accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \
  train_controlnet_sdxl.py [args...]

H100-Optimized Training Parameters

The H100 GPU has 80GB VRAM and 1979 TFLOPs, allowing for larger batch sizes and better optimization than A100.

Optimal Batch Size for H100

Default settings (designed for A100 40GB):

--train_batch_size=16
--gradient_accumulation_steps=4
# Effective batch size: 16 × 4 = 64 samples/step
# VRAM usage: ~22-28GB

H100-optimized settings (80GB VRAM):

--train_batch_size=32  # 2× larger than A100
--gradient_accumulation_steps=4
# Effective batch size: 32 × 4 = 128 samples/step
# VRAM usage: ~40-48GB (still plenty of headroom)

Aggressive H100 settings (maximum throughput):

--train_batch_size=48  # 3× larger than A100
--gradient_accumulation_steps=2  # Reduce accumulation since batch is larger
# Effective batch size: 48 × 2 = 96 samples/step
# VRAM usage: ~55-65GB
# Faster training due to fewer gradient accumulation steps

Single H100 Training Command (99k samples)

Optimized for H100 80GB:

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-h100"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=99000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=32 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=2 \
  --checkpointing_steps=750 \
  --validation_steps=750 \
  --tracker_project_name="brightness-controlnet-sdxl-h100" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --dataloader_num_workers=8 \
  --set_grads_to_none

Key H100 Optimizations:

--train_batch_size=32 (vs 16 on A100) - 2× larger batches
--gradient_accumulation_steps=4 - Effective batch = 128
--checkpointing_steps=750 - More frequent (every ~96k samples)
--dataloader_num_workers=8 - Faster data loading (H100 has 192 CPUs)
--set_grads_to_none - Faster than zero_grad() on modern GPUs

Expected Performance:

Steps per epoch: 99,000 ÷ 128 = 773 steps
Total steps (2 epochs): ~1,546 steps
Training time: ~38-45 minutes on single H100
Checkpoints saved at: 750, 1500 steps

6× H100 Training Command (3M samples) - Pro Plan

For Pro plan multi-GPU training:

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-multi-h100"

# Configure accelerate for 6 GPUs (if not done already)
accelerate config  # Select MULTI_GPU, 6 processes

# Launch training
accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=2999000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=2 \
  --num_train_epochs=1 \
  --checkpointing_steps=2500 \
  --validation_steps=2500 \
  --tracker_project_name="brightness-controlnet-sdxl-3M" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --dataloader_num_workers=8 \
  --set_grads_to_none \
  --resume_from_checkpoint="latest"

Multi-GPU Optimizations:

--train_batch_size=24 per GPU × 6 GPUs = 144 samples per step (before accumulation)
--gradient_accumulation_steps=2 - Effective batch = 144 × 2 = 288
--checkpointing_steps=2500 - Save every ~720k samples
--resume_from_checkpoint="latest" - Auto-resume if interrupted

Expected Performance:

Effective batch size: 288 samples/step
Steps per epoch: 2,999,000 ÷ 288 = ~10,413 steps
Training time: ~4 hours on 6× H100
Checkpoints: 2500, 5000, 7500, 10000 steps + final

Batch Size Selection Guide

GPU Config	VRAM	Recommended batch_size	grad_accum_steps	Effective Batch	Training Speed
Single L4	24GB	8	4	32	Slow (baseline)
Single A100	40GB	16	4	64	2× faster than L4
Single H100	80GB	32	4	128	6× faster than L4
6× H100 (Pro)	480GB	24/GPU	2	288	36× faster than L4

Rule of Thumb:

Larger train_batch_size = better GPU utilization, faster training
Larger effective_batch_size = more stable training, better convergence
H100 can handle 2-3× larger batch sizes than A100 with same settings

Memory Optimization Tips

If you encounter OOM (Out of Memory) errors on H100:

Reduce batch size incrementally:

--train_batch_size=32  # Start here
--train_batch_size=24  # If OOM
--train_batch_size=16  # If still OOM

Enable additional memory optimizations:

--gradient_checkpointing \  # Already enabled
--use_8bit_adam \           # Already enabled
--enable_xformers_memory_efficient_attention \  # Already enabled
--set_grads_to_none \       # Use this instead of zero_grad()

Use gradient accumulation to maintain effective batch size:

# If reducing from batch_size=32 to batch_size=16
--train_batch_size=16
--gradient_accumulation_steps=8  # Double accumulation to keep effective=128

Full 3M Dataset Training Options

For maximum quality training on the complete dataset:

Option A: Single H100 (Free Tier)

Metric	Value
GPU	1× H100 80GB (~$2.50/hr on Lightning.ai)
Dataset	2,999,000 training + 1,000 validation
Estimated Duration	~24 hours
Estimated Cost	$60 GPU credits
Subscription Cost	$0 (Free tier)
Total Cost	$60
Checkpoints	Every 5000 steps (~every 320k samples)

Pros:

✅ Lowest total cost
✅ No subscription required
✅ Good for one-time training

Cons:

❌ 24 hours training time (must monitor)
❌ Can't quickly iterate if issues arise

Option B: 6× H100 (Pro Plan - $20/month)

Metric	Value
GPU	6× H100 80GB (~$2.50/hr × 6 = $15/hr)
Dataset	2,999,000 training + 1,000 validation
Estimated Duration	~4 hours
Estimated Cost	$60 GPU credits
Subscription Cost	$20/month (annual billing)
Total Cost	$80
Net Cost	$67 (after $13 annual credit value)
Checkpoints	Every 5000 steps (~every 320k samples)

Pros:

✅ Completes in 4 hours vs 24 hours
✅ Can run same-day if needed
✅ Can test multiple configs quickly
✅ Net cost only $7/month after credits
✅ Can cancel after training

Cons:

❌ $20 upfront subscription cost

Scaling Math:

Single H100: 99k in 45 min → 3M in 45 min × 30.3 = ~24 hours
6× H100: 24 hours ÷ 6 = ~4 hours

Cost Comparison:

Free tier: $60, 24 hours wait
Pro plan: $80, 4 hours wait
Price difference: $20 to save 20 hours

Adjusted Training Command

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-3M"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=2999000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=1 \
  --checkpointing_steps=5000 \
  --validation_steps=5000 \
  --validation_prompts="a beautiful garden scene" "modern city street" "abstract art pattern" \
  --tracker_project_name="brightness-controlnet-sdxl-3M" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --resume_from_checkpoint="latest"

Key Adjustments Explained

Batch Size Scaling:

--train_batch_size=24 (increased from 16)
- H100 80GB has 2x VRAM of A100 40GB
- Can safely increase batch size by 50%
- Alternative: --train_batch_size=32 if you have headroom
--gradient_accumulation_steps=4 (kept same)
- Effective batch size: 24 × 4 = 96 samples/step
- If using batch_size=32: 32 × 4 = 128 samples/step

Dataset & Checkpointing:

--max_train_samples=2999000 (vs 99,000 for quick training)
- Training split: train[:2999000] (first 2,999,000 samples)
- Validation split: train[2999000:3000000] (SAME as 99k training!)
- ✅ This allows direct comparison of validation metrics between 99k and 3M training
- ✅ No overlap between training and validation data
--num_train_epochs=1 (vs 2)
- For 3M samples, 1 epoch is usually sufficient
- Can increase to 2 if quality needs improvement
--checkpointing_steps=5000 (vs 1,500)
- More frequent checkpoints would create too many files
- 5000 steps = every ~480k samples
- Total checkpoints: ~6-7 for full run
--validation_steps=5000 (matches checkpointing)
- Run validation at each checkpoint

Resumption:

--resume_from_checkpoint="latest"
- CRITICAL for multi-day training
- If training crashes, automatically resumes from last checkpoint
- Saves days of retraining if interrupted

Training Math

Steps Calculation:

Training samples: 2,999,000 (validation: 1,000)
Effective batch size: 96 (or 128 with batch_size=32)
Steps per epoch: 2,999,000 ÷ 96 = 31,240 steps
- With batch_size=32: 2,999,000 ÷ 128 = 23,429 steps
For 1 epoch: 31,240 steps total
For 2 epochs: 62,480 steps total

Checkpoints:

Saved every 5,000 steps
Checkpoint locations: steps 5000, 10000, 15000, 20000, 25000, 30000, 31240 (final)
Each checkpoint: ~2.5GB (ControlNet weights)
Total storage: ~20GB for all checkpoints + training state

VRAM Usage (H100 80GB)

With batch_size=24:

SDXL base model (FP16): ~6-7GB
ControlNet model: ~2.5GB
8-bit Adam optimizer: ~3-4GB
Gradients (with checkpointing): ~3-4GB
Activations (batch 24): ~15-20GB
Total: ~35-40GB ✅ Fits comfortably in 80GB

With batch_size=32 (max):

Activations increase to ~20-25GB
Total: ~42-48GB ✅ Still fits with headroom

Recommended: Start with batch_size=24, monitor VRAM in W&B, can increase to 32 if using <60GB.

Risk Mitigation for Long Training

Strategy 1: Incremental Training

# Start with 500k samples to validate approach
--max_train_samples=500000
# Cost: ~$150, Duration: ~75 hours
# If results good, continue to full 3M

Strategy 2: Early Checkpoint Evaluation

# Evaluate quality at checkpoints:
# - checkpoint-5000  (~480k samples, ~32 hours, ~$63)
# - checkpoint-10000 (~960k samples, ~64 hours, ~$127)
# - checkpoint-15000 (~1.4M samples, ~96 hours, ~$191)
# Can stop early if quality plateaus

Strategy 3: Use Spot Instances

Many cloud providers offer H100 spot instances at 50-70% discount
Cost could drop to $0.60-$1.00/hr (~$270-$600 total)
Requires --resume_from_checkpoint="latest" (already included)
Risk: Training may be interrupted, but will resume automatically

When to Use Full 3M Training

Use 99k samples if:

✅ First time training ControlNet
✅ Testing hyperparameters
✅ Budget constrained (<$50)
✅ Need results quickly (1-2 days)

Use 3M samples if:

✅ 99k results are good but want better quality
✅ Commercial production use (worth the investment)
✅ Training other ControlNet types (can reuse knowledge)
✅ Contributing to research/community (publishable results)
✅ Budget allows ($900-$1,200)

Phase 4: Training Monitoring

Setup Weights & Biases:

wandb login
# Use wandb to track:
# - Loss curves
# - Validation images every 500 steps
# - Learning rate schedule
# - GPU utilization

Checkpoints:

Saved every 1,500 steps to $OUTPUT_DIR/checkpoint-{step}
With ~3,094 total steps, will get checkpoints at:
- checkpoint-1500 (~97% of epoch 1)
- checkpoint-3000 (~94% of epoch 2)
- Final model at end of training
Can resume training if interrupted: --resume_from_checkpoint="./controlnet-brightness-sdxl/checkpoint-1500"

Validation:

Uses 1,000 validation samples from train[99000:100000]
Runs every 1,500 steps (at checkpoints)
W&B logs validation images and metrics
No need for manual validation prompts/images

Validation Metrics (Automatic)

No configuration needed! The training script automatically computes validation metrics:

Loss Function (Automatic):

Default: MSE (Mean Squared Error) loss between predicted and target images
Optional: Huber loss - add --loss_type="huber" to training command
Formula: loss = F.mse_loss(model_pred.float(), target.float())

What Gets Logged to W&B:

Training loss (every step)
Validation loss (every --validation_steps=1500 steps)
Validation images (generated samples at validation time)
Learning rate (schedule tracking)
GPU utilization (hardware monitoring)

Validation Process:

Every 1500 steps, training pauses
Model generates images from validation set
Same MSE/Huber loss computed on validation samples
Loss + images logged to W&B
Training resumes

No manual metrics needed - everything is handled by the training script!

Phase 5: Model Evaluation & Publishing

Test Inference:

First, install QR code library if needed:

pip install qrcode[pil]

Then run inference:

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
import torch
import qrcode
from PIL import Image

# Generate QR code for testing
print("Generating QR code for https://google.com...")
qr = qrcode.QRCode(
    version=1,
    error_correction=qrcode.constants.ERROR_CORRECT_H,
    box_size=10,
    border=4,
)
qr.add_data("https://google.com")
qr.make(fit=True)

# Create QR code image and resize to 1024x1024
qr_image = qr.make_image(fill_color="black", back_color="white")
qr_image = qr_image.resize((1024, 1024), Image.LANCZOS)
print(f"QR code generated: {qr_image.size}")

# Load trained ControlNet
print("Loading ControlNet model...")
controlnet = ControlNetModel.from_pretrained(
    "./controlnet-brightness-sdxl/checkpoint-3000",  # or checkpoint-1500
    torch_dtype=torch.float16
)

# Load SDXL pipeline with ControlNet
print("Loading SDXL pipeline...")
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    torch_dtype=torch.float16
)
pipe.enable_xformers_memory_efficient_attention()
pipe.to("cuda")

# Generate artistic QR code
print("Generating artistic QR code...")
image = pipe(
    prompt="a beautiful garden scene with flowers, highly detailed, professional photography",
    negative_prompt="blurry, low quality, distorted",
    image=qr_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.45,  # Adjust 0.3-0.6 for balance
    guidance_scale=7.5,
).images[0]

# Save results
qr_image.save("original_qr.png")
image.save("artistic_qr_result.png")
print("✅ Done! Check artistic_qr_result.png")
print("📱 Scan with phone to verify QR code still works!")

Testing Different Conditioning Scales:

# Test multiple conditioning scales to find best balance
for scale in [0.3, 0.4, 0.5, 0.6]:
    print(f"Testing conditioning_scale={scale}...")
    image = pipe(
        prompt="a beautiful garden scene with flowers",
        image=qr_image,
        num_inference_steps=30,
        controlnet_conditioning_scale=scale,
    ).images[0]
    image.save(f"result_scale_{scale}.png")

Publish to HuggingFace Hub:

# After validation
huggingface-cli login
python scripts/upload_to_hub.py \
  --model_path="./controlnet-brightness-sdxl/checkpoint-50000" \
  --repo_name="Oysiyl/controlnet-brightness-sdxl"

Cost-Benefit Analysis

Investment Required (Updated for Single H100)

Strategy A: Free Tier (99k Quick Test)

Component	Cost/Time
GPU Credits (99k samples, 2 epochs, single H100)	$1.88
Setup Time	1-2 hours
Training Duration	45 minutes ⚡
Testing & Validation	2-3 hours
Total Time	~4-6 hours (same day)
Total Cost	$1.88

Strategy B: Pro Plan (Full 3M Training)

Component	Cost/Time
Pro Subscription (can cancel after)	$20/month
Included credits value	-$13 (240 credits/year)
GPU Credits (3M samples, 1 epoch, 6×H100)	$60
Setup Time	1-2 hours
Training Duration	4 hours ⚡
Testing & Validation	2-3 hours
Total Time	~8 hours (same day)
Total Cost	$80 ($20 sub + $60 GPU)
Net Cost	$67 (after annual credit value)

Strategy C: All-in-One (Pro Plan, Test Everything)

Component	Cost/Time
Pro Subscription	$20/month
99k test (6×H100)	$1.88 (7.5 min)
500k training (6×H100)	$10 (40 min)
3M training (6×H100)	$60 (4 hours)
Total GPU Time	~5 hours
Total GPU Cost	$71.88
Total with Sub	$91.88
Net after credits	$78.88

Recommendation: Start with Strategy A ($1.88), upgrade to Strategy B if promising

Value Delivered

Unblocks SDXL Migration: Enables upgrade from SD 1.5 to higher quality SDXL
Better Image Quality: SDXL produces superior 1024×1024 images vs SD 1.5's 512×512
Community Value: First public SDXL brightness ControlNet (potential citations/recognition)
No Alternatives: Cannot proceed with SDXL QR code generation without this model
Reusable Asset: Once trained, can be used indefinitely

Risk Mitigation

Start Small: Train on 100k samples first (~$40, 1-2 days)
Evaluate Early: Check quality at checkpoint-5000, checkpoint-10000
Iterative Approach: Extend training only if initial results are promising
Fallback: Can continue using SD 1.5 if SDXL training fails

Alternative Approaches Considered

Option 1: Train Brightness ControlNet for SDXL (RECOMMENDED)

Pros:
- Proven training pipeline (diffusers script exists)
- Same dataset as original SD 1.5 model
- Good quality/cost balance
- Community support and documentation
- License-friendly (SDXL is permissive)
Cons:
- Requires GPU time investment ($75-$300)
- 4-5 days training duration
- Still requires 24GB+ VRAM for inference
Cost: $155 for 500k samples on A100 (recommended)
Risk: Low - well-documented process
Verdict: ✅ Best choice for production use

Option 2: Train Brightness ControlNet for Flux Schnell

Pros:
- Apache 2.0 license (fully commercial)
- Faster inference than Flux Dev (3× speedup)
- Same architecture as Dev (12B parameters)
- Would be first-of-its-kind community contribution
Cons:
- ⚠️ No existing training scripts for Schnell
- Would need to adapt Flux Dev training code
- Unknown if distillation affects ControlNet training
- Still requires 32-40GB VRAM (heavier than SDXL)
- Higher risk and uncertainty
- Longer training time due to larger model
Cost: $200-$500 (estimated, higher due to larger model)
Risk: High - experimental, no precedent
Verdict: 🔬 Experimental - only if willing to pioneer new territory

Option 3: Use SDXL LoRA for Brightness Control

Pros: No training required, immediate availability
Cons: Less precise control than dedicated ControlNet, may not work well for QR codes
Verdict: Worth testing but likely insufficient for QR code use case

Option 4: Latent Initialization Approach

Pros: Architecture-agnostic, works with both SDXL and Flux
Cons: Less control over brightness distribution, requires experimentation
Verdict: Good fallback but not as reliable as ControlNet

Option 5: Wait for Community Release

Pros: Zero cost, zero effort
Cons: No timeline, may never happen, blocks project progress
Verdict: Not viable for active development

Option 6: Hybrid Tile ControlNet + Post-Processing

Pros: Tile ControlNet available for SDXL
Cons: Doesn't address brightness control directly
Verdict: Complementary but not a replacement

Conclusion: Training SDXL ControlNet is the most reliable solution. Flux Schnell is interesting for research but carries significant execution risk.

Recommended Action Plan

Immediate Setup (Day 1)

Launch Lightning AI Instance: A100 40GB GPU
Run Setup Commands: Install all dependencies (see Phase 3 above)
Authenticate: HuggingFace and W&B login
Clone Diffusers: Get training scripts

Training Phase (Day 1 - Morning) ⚡

Start Training: Launch training with 99k samples (~45 minutes on 8×H100)
Monitor W&B: Track loss curves and validation images in real-time
First Checkpoint: Review checkpoint-1500 (~25 minutes in)
Training Complete: Total ~45 minutes for full 2-epoch run

Evaluation Phase (Day 1 - Afternoon)

Post-Training Validation: Run inference on 1k validation set
QR Code Testing: Test with actual QR codes, measure scannability
Quality Assessment: Compare to SD 1.5 brightness ControlNet
Decision Point:
- If quality good: Publish and integrate (move to next phase)
- If needs improvement: Launch 2nd training run with adjusted hyperparameters (~45 min)
- Can try 3-4 different configurations in same day!

Optional: Full Dataset Training (Day 1 - Evening)

12a. If 99k results promising: Launch full 3M training (~2 hours on 8×H100) 12b. Monitor overnight: W&B tracks progress automatically 12c. Next morning: Evaluate final model quality

Integration Phase (Day 2)

Publish to HuggingFace: Upload best checkpoint
Update app_sdxl.py: Integrate new ControlNet model
Production Testing: End-to-end QR code generation tests
Documentation: Update README with SDXL support

Total Timeline: 1-2 days (vs previous estimate of 5 days)

Success Metrics

QR Code Scannability: 95%+ scan rate on generated images
Visual Quality: Subjective improvement over SD 1.5 outputs
Control Precision: Ability to adjust brightness strength (0.0-1.0 range)
Training Loss: Convergence to < 0.1 validation loss
Community Adoption: Positive feedback if published publicly

Critical Files to Modify

Once model is trained:

app.py:48-56 - Add SDXL ControlNet loading
app.py:1880-1886 - Update standard pipeline with SDXL support
app.py:2343-2349 - Update artistic pipeline with SDXL support
app_sdxl.py - Complete SDXL-specific implementation
comfy/sd_configs/ - Add SDXL configuration if needed

Flux Schnell Training Considerations (If Pursuing)

If you decide to pursue Flux Schnell ControlNet training despite the risks:

Required Adaptations:

Training Script Modification: Adapt train_controlnet_flux.py to work with Schnell
- Model path: black-forest-labs/FLUX.1-schnell instead of FLUX.1-dev
- Verify architecture compatibility (distillation may affect ControlNet layers)
- Test with small pilot run (1000 steps) before full training
Hardware Requirements:
- Minimum: H100 (80GB VRAM) - $1.99/hr
- A100 40GB likely insufficient for Flux training
- Estimated training: 150-250 hours on H100 (~$300-$500)
Dataset Considerations:
- Flux uses 1024×1024 resolution (same as SDXL)
- Dataset would need upscaling from 512×512 or re-preprocessing
- Consider starting with 100k subset for validation
Verification Steps:
- Test if Schnell's distillation preserves ControlNet training capability
- Compare with Flux Dev training (if available for testing)
- Validate brightness control precision matches SD 1.5 quality

Risk Assessment:

Technical Risk: High - no proven training path
Time Risk: Medium-High - debugging could extend timeline significantly
Cost Risk: High - may require multiple training attempts ($500+)
Success Probability: 50-70% (educated guess based on architecture similarity)

Recommendation: Only pursue if:

SDXL training completes successfully first (de-risk approach)
You're willing to contribute pioneering work to the community
Budget allows for experimental work ($500-1000 total including failed attempts)

References

SDXL Training

SDXL Training Script: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sdxl.py
Dataset: https://huggingface.co/datasets/latentcat/grayscale_image_aesthetic_3M
Reference Article: https://latentcat.com/en/blog/brightness-controlnet
Original SD 1.5 Model: https://huggingface.co/latentcat/latentcat-controlnet
Lightning AI: https://lightning.ai/

Flux Information

Flux Schnell Model: https://huggingface.co/black-forest-labs/FLUX.1-schnell
Flux Dev Training Script: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_flux.py
XLabs-AI Flux ControlNets: https://huggingface.co/XLabs-AI/flux-controlnet-collections
Flux Comparison Guide: Flux Dev vs Schnell Comparison
Flux Architecture Discussion: GitHub Issue #408
License Comparison: Flux Model Guide

Final Recommendation (Updated December 2024 - Lightning.ai)

Proceed with SDXL Brightness ControlNet Training on Single H100 (Free Tier)

Based on Lightning.ai pricing and multi-GPU requirements, the recommended path is:

Phase 1: Quick Validation (Free Tier)

Start with 99k samples on single H100
- Cost: $1.88 in GPU credits
- Duration: 45 minutes
- Platform: Lightning.ai Free tier
- Purpose: Validate training pipeline and quality

Phase 2: Production Training (Choose Based on Phase 1)

Option A: Budget Approach (Free Tier)

Run full 3M dataset on single H100
Cost: $60 GPU credits, $0 subscription
Duration: 24 hours
Total: $60
Best for: One-time training, have patience

Option B: Speed Approach (Pro Plan)

Upgrade to Pro plan ($20/month annual)
Run full 3M dataset on 6× H100
Cost: $60 GPU + $20 subscription = $80
Net cost: $67 (after $13 annual credit value)
Duration: 4 hours
Best for: Need results same day, may iterate

Recommended Strategy

Most Cost-Effective Path:

Day 1 Morning: Run 99k test on Free tier ($1.88, 45 min)
Day 1 Afternoon: Evaluate results
If promising:
- Budget route: Start 3M on Free tier ($60, 24 hrs) → Total: $61.88
- Speed route: Upgrade to Pro, run 3M ($80, 4 hrs) → Total: $81.88
Cancel Pro after training if using speed route

Why This Path

Low Risk Entry: Only $1.88 to validate entire pipeline
Flexible Scaling: Choose speed vs cost based on results
Proven Pipeline: HuggingFace Diffusers battle-tested script
Reference Success: Original SD 1.5 model trained on same dataset
H100 Advantage: 6.3× faster than A100 even on single GPU
Cost-Effective: $62-$82 total (vs $900+ on older plans)
Unblocks Migration: Enables full SDXL upgrade from SD 1.5

Cost Breakdown Comparison

Approach	Hardware	Duration	GPU Cost	Sub Cost	Total	Timeline
Old Plan (A100)	Single A100	180 hours	$900-1,200	$0	$900-1,200	1 week
NEW: Free Tier	Single H100	24.75 hours	$61.88	$0	$61.88	2 days
NEW: Pro Plan	6× H100	4.75 hours	$61.88	$20	$81.88	1 day

Savings vs Old Plan:

Free tier: Save $838-$1,138 and 6 days
Pro plan: Save $818-$1,118 and 6 days

Pro Plan ROI Analysis

When is Pro worth it?

$20 extra to save 20 hours (24h → 4h)
= $1/hour saved
Plus: Can test multiple hyperparameters same day
Plus: Includes $13/year in credits

Get Pro if:

✅ You value time over $1/hour
✅ Planning to iterate on hyperparameters
✅ Need results urgently
✅ Want to test 99k + 500k + 3M in one session

Skip Pro if:

✅ Doing one-time training only
✅ Can wait 24 hours
✅ Budget constrained
✅ 99k test was sufficient

Next Steps

Once plan is approved:

Set up Lightning AI account with A100 GPU access
Clone diffusers repository and install requirements
Verify dataset access and download capabilities
Prepare validation QR codes for quality testing
Launch training with recommended hyperparameters
Monitor via Weights & Biases for loss curves and validation images
Evaluate checkpoints at 10k, 25k, 50k steps
Complete training and publish to HuggingFace Hub
Integrate into app_sdxl.py for production use

Flux Schnell remains an option for future exploration once SDXL is production-ready, but is deprioritized due to experimental nature and higher resource requirements.