AI-QR-code-generator

Sleeping

File size: 58,973 Bytes

# Training ControlNet Brightness for SDXL - Feasibility Analysis

## Executive Summary

Training a brightness ControlNet for SDXL is **technically feasible and recommended** as the critical upgrade path from SD 1.5 to SDXL for QR code generation. This model is essential because no public SDXL brightness ControlNet exists.

**Key Estimates (Updated December 2024 - Single H100 GPU):**
- **Time**: 45 minutes (99k samples) to 24 hours (3M samples) on single H100
- **Cost**: $13 (99k) to $418 (3M) in GPU credits
- **Platform**: Lightning.ai with optional Pro plan ($20/month for multi-GPU)
- **Priority**: High - enables SDXL migration for QR code generation
- **Complexity**: Medium - well-documented training pipeline with reference implementation

**Recommended Path:**
- Start with single H100 for 99k samples (~45 min, $13)
- If successful, optionally upgrade to Pro plan for faster 3M training
- Total investment: $13-$138 depending on training size and plan choice

## Background Context

### Current Implementation (SD 1.5)
- **Location**: `app.py:1880-1886, 2343-2349`
- **Model**: `control_v1p_sd15_brightness.safetensors` from latentcat/latentcat-controlnet
- **Purpose**: Controls QR code pattern visibility via brightness conditioning
- **Critical**: Essential for QR code readability - cannot be removed

### Why SDXL Brightness ControlNet is Needed
1. **No Public Alternative**: No SDXL-equivalent brightness ControlNet exists on HuggingFace
2. **Migration Blocker**: Current SD 1.5 brightness ControlNet incompatible with SDXL architecture
3. **QR Readability**: Brightness control is core to balancing aesthetic quality with QR scannability
4. **Flux is Too Heavy**: SDXL is the practical upgrade path (Flux requires 32-40GB VRAM)

### Flux Model Landscape (Updated Analysis)

**Flux Schnell (Apache 2.0 License)**
- **License**: Fully open for commercial use - no restrictions
- **Architecture**: Same 12B parameters as Flux Dev, but distilled for speed (3× faster)
- **Quality**: Lower than Dev due to aggressive distillation trading detail for speed
- **VRAM**: Still requires 32-40GB (same as Dev)
- **ControlNet Status**: ⚠️ **No existing ControlNet models or training scripts**
- **Training Risk**: Would require adapting Flux Dev training script - pioneering work
- **Community**: Active requests for Schnell ControlNets but no official releases

**Flux Dev (Non-Commercial License)**
- **License**: Non-commercial only - cannot be used for commercial QR code generation
- **ControlNet Status**: ✅ Extensive support (XLabs-AI, InstantX collections)
- **Training Scripts**: Available from XLabs-AI and HuggingFace Diffusers
- **Quality**: Superior to Schnell, but license restrictions make it unsuitable

**Flux Pro (Commercial API)**
- **License**: API-only, commercial pricing
- **Status**: Not suitable for self-hosted training

**Assessment**: While Flux Schnell has an attractive license, the lack of proven ControlNet training pipeline makes it **high-risk**. SDXL remains the **proven, practical choice**.

## Hardware Selection & Platform Strategy

### Lightning.ai Pricing Tiers (December 2024)

Lightning.ai offers different tiers with varying multi-GPU capabilities:

| Plan | Cost | Multi-GPU | Max GPUs | Credits Included | Best For |
|------|------|-----------|----------|------------------|----------|
| **Free** | $0 | ❌ No | 1 | 15/month | Quick 99k test |
| **Pro** | **$20/month** (annual) | ✅ Yes | 6 | 240/year (~$13/mo) | **Recommended** |
| Teams | $119/month (annual) | ✅ Yes | 12 | 600/year | Large teams |

**Pro Plan Benefits:**
- Only **$20/month** if paid annually ($240/year vs $600 monthly)
- Includes **240 credits/year** = ~$13 of free GPU time
- **Net cost: ~$7/month** after credits
- Multi-GPU training up to 6 GPUs
- Can cancel after training completes

### GPU Comparison Analysis (Lightning.ai)

**Single GPU Performance:**

| GPU | TFLOPs | Memory | Cost/hr | 99k Time | 99k Cost | 3M Time | 3M Cost |
|-----|--------|--------|---------|----------|----------|---------|---------|
| A100 | 312 | 40GB | ~$1.50 | 4-6 hours | $6-9 | 120-180 hours | $180-270 |
| **H100** | **1979** | **80GB** | **~$2.50** | **45 min** | **$1.88** | **24 hours** | **$60** |

**Cost Efficiency:**
- H100 is **6.3× faster** than A100 (1979 vs 312 TFLOPs)
- H100 costs **1.67× more** per hour on Lightning.ai
- **Net result: 3.8× better cost efficiency**

### Single vs Multi-GPU: Should You Get Pro Plan?

#### Option A: Free Plan (Single H100)

| Training Size | Duration | GPU Cost | Total Cost | Timeline |
|---------------|----------|----------|------------|----------|
| 99k samples | 45 min | $1.88 | **$1.88** | Same day |
| 500k samples | 4 hours | $10 | **$10** | Same day |
| 3M samples | 24 hours | $60 | **$60** | 1-2 days |

**Pros:**
- ✅ $0 subscription cost
- ✅ Very cheap for 99k testing
- ✅ Good for one-off training

**Cons:**
- ❌ 24 hours for 3M training (must babysit)
- ❌ Can't test multiple hyperparameters quickly
- ❌ Limited to 15 free credits/month

#### Option B: Pro Plan (6× H100)

| Training Size | Duration | GPU Cost | Subscription | Total Cost | Timeline |
|---------------|----------|----------|--------------|------------|----------|
| 99k samples | **7.5 min** | $1.88 | $20 | **$21.88** | Minutes |
| 500k samples | **40 min** | $10 | $20 | **$30** | Same hour |
| 3M samples | **4 hours** | $60 | $20 | **$80** | Same day |

**Multi-GPU costs same because:**
- 6× GPUs = 6× faster
- 6× GPUs = 6× more expensive per hour
- Net: Same total GPU cost, much faster completion

**Pros:**
- ✅ 3M training finishes in 4 hours (vs 24)
- ✅ Can test 3-4 hyperparameter configs in one day
- ✅ Includes 240 credits/year (~$13 value)
- ✅ Real net cost: $7/month after credits
- ✅ Can cancel after training done

**Cons:**
- ❌ $20 upfront cost (annual commitment)

### Recommendation Matrix

**If you're doing ONE 99k training run:**
- ✅ **Use Free tier** ($1.88 total, 45 min)
- Skip Pro plan - not worth $20 for 7.5 min vs 45 min

**If you're doing 500k OR 3M training:**
- ✅ **Get Pro plan** ($20/month)
- 3M: 4 hours vs 24 hours = worth it
- Can test multiple configs same day
- Net cost after credits: ~$7/month

**If you're doing multiple experiments:**
- ✅ **Definitely get Pro plan**
- Test 99k + 500k + 3M all in one day
- Total time: ~5 hours vs 30+ hours
- Total cost: $20 + ~$72 GPU = $92
- Cancel Pro after training complete

**Most Cost-Effective Strategy:**
1. Start with **Free tier** for 99k test ($1.88, 45 min)
2. If results promising, upgrade to **Pro** for 3M training
3. Run full training in 4 hours
4. Cancel Pro after done
5. Total: $20 Pro + $60 GPU + $1.88 test = **$81.88**

### Updated Training Timeline Estimates

**Single H100 (Free Tier):**

| Training Size | Duration | Total Cost | When to Use |
|---------------|----------|------------|-------------|
| **99k samples** | 45 min | $1.88 | Quick validation, hyperparameter testing |
| **500k samples** | 4 hours | $10 | Medium quality, budget option |
| **3M samples** | 24 hours | $60 | Max quality, have patience |

**6× H100 (Pro Plan at $20/month):**

| Training Size | Duration | Total Cost | When to Use |
|---------------|----------|------------|-------------|
| **99k samples** | 7.5 min | $21.88 | Ultra-fast iteration |
| **500k samples** | 40 min | $30 | Production ready, same day |
| **3M samples** | 4 hours | $80 | Best quality, same day results |

## Training Strategy

### Dataset: latentcat/grayscale_image_aesthetic_3M
- **Size**: 3 million images at 512×512 resolution
- **Format**: Parquet files with image/conditioning_image/text columns
- **Same Dataset**: Used for original SD 1.5 brightness ControlNet training
- **License**: Latent Cat (check license before commercial use)
- **Quality**: Pre-processed grayscale images with aesthetic filtering

### Reference Training Results (from latentcat article)
| Configuration | Samples | Hardware | Duration | Cost Estimate |
|--------------|---------|----------|----------|---------------|
| Original SD 1.5 | 100k | A6000 | 13 hours | ~$20 (est.) |
| Original SD 1.5 | 3M | TPU v4-8 | 25 hours | N/A (TPU) |

### SDXL Training Scaling Estimates

**Updated Based on Latentcat Article:**
- Training at 512×512 resolution (NOT 1024×1024) - matches dataset and original training
- SDXL has larger UNet architecture (~2.5GB vs 1.7GB for SD 1.5)
- Expected slowdown: 2-3× compared to SD 1.5 training

**Time Estimates for 99k Training Samples (Lightning.ai Single H100):**

## Calculation Methodology

**Baseline Reference:**
- Latentcat article: 100k samples on A6000 = 13 hours (SD 1.5)
- SDXL overhead: 13h × 2.5 (larger architecture) = ~32.5 hours for 100k
- A6000 ≈ A100 in performance (~300-312 TFLOPs)

**Scaling to H100:**
- A100: 312 TFLOPs → ~4-6 hours for 99k samples
- H100: 1979 TFLOPs → 6.3× faster
- **H100 single GPU: ~38-57 minutes for 99k samples**

**Multi-GPU Scaling (Pro Plan):**
- 6× H100 GPUs = 6× faster = ~7.5 minutes for 99k
- Total cost stays same (6× faster but 6× more expensive/hour)

## Recommended Configurations

**🏆 OPTION 1: Free Tier (Single H100) - Best for Testing**
- **99k samples**: 45 min, $1.88
- **500k samples**: 4 hours, $10
- **3M samples**: 24 hours, $60
- **Best for:** One-off training, budget-conscious, have patience

**🚀 OPTION 2: Pro Plan (6× H100) - Best for Production**  
- **Subscription**: $20/month (annual), includes $13 credits = **$7 net cost**
- **99k samples**: 7.5 min, $21.88 total ($1.88 GPU + $20 sub)
- **500k samples**: 40 min, $30 total ($10 GPU + $20 sub)
- **3M samples**: 4 hours, $80 total ($60 GPU + $20 sub)
- **Best for:** Multiple experiments, 3M training, need results same day

**Cost Comparison Summary:**

| Scenario | Free Tier | Pro Plan | Savings (Pro) |
|----------|-----------|----------|---------------|
| Single 99k test | $1.88 | $21.88 | ❌ $20 more |
| Single 3M training | $60 | $80 | ❌ $20 more |
| 99k + 500k + 3M | $71.88 (30 hours) | $92 (5 hours) | ✅ Save 25 hours |
| 3+ experiments | $71.88+ (30+ hours) | $92 (5-6 hours) | ✅ Save 24+ hours |

**Recommendation:**
- For single 99k test: **Use Free Tier** (not worth $20 for speed)
- For 3M training: **Consider Pro** (4 hrs vs 24 hrs = big difference)
- For multiple runs: **Definitely Pro** (can test everything in one day)

## Technical Implementation Plan

### Dataset Verification Script

**Create this script to verify dataset before training:**

```bash
cat > verify_dataset.py << 'EOF'
#!/usr/bin/env python3
"""
Dataset verification script for SDXL ControlNet Brightness training.
Downloads a subset of the dataset and verifies structure.

Usage: python verify_dataset.py
"""

from datasets import load_dataset
from PIL import Image
import sys

def verify_dataset():
    print("=" * 60)
    print("SDXL ControlNet Brightness - Dataset Verification")
    print("=" * 60)

    print("\n[1/4] Loading dataset subset (99k samples)...")
    print("This will download ~10-15GB to cache...")

    try:
        train_dataset = load_dataset(
            "latentcat/grayscale_image_aesthetic_3M",
            split="train[:99000]",
            cache_dir="~/.cache/huggingface/datasets"
        )
        print(f"✅ Successfully loaded {len(train_dataset)} samples")
    except Exception as e:
        print(f"❌ Failed to load dataset: {e}")
        sys.exit(1)

    print("\n[2/4] Verifying dataset structure...")
    expected_columns = {"image", "conditioning_image", "text"}
    actual_columns = set(train_dataset.column_names)

    if actual_columns == expected_columns:
        print(f"✅ Columns correct: {train_dataset.column_names}")
    else:
        print(f"❌ Column mismatch!")
        print(f"   Expected: {expected_columns}")
        print(f"   Got: {actual_columns}")
        sys.exit(1)

    print("\n[3/4] Checking sample data...")
    sample = train_dataset[0]

    # Check images
    if isinstance(sample['image'], Image.Image):
        img_size = sample['image'].size
        print(f"✅ Image type: PIL.Image, size: {img_size}")
    else:
        print(f"❌ Unexpected image type: {type(sample['image'])}")

    if isinstance(sample['conditioning_image'], Image.Image):
        cond_size = sample['conditioning_image'].size
        print(f"✅ Conditioning image type: PIL.Image, size: {cond_size}")
    else:
        print(f"❌ Unexpected conditioning image type: {type(sample['conditioning_image'])}")

    if isinstance(sample['text'], str):
        caption_len = len(sample['text'])
        print(f"✅ Caption type: str, length: {caption_len} chars")
        print(f"   Sample caption: '{sample['text'][:100]}...'")
    else:
        print(f"❌ Unexpected caption type: {type(sample['text'])}")

    print("\n[4/4] Checking validation split (last 1000 samples)...")
    try:
        # IMPORTANT: Always use last 1000 samples for validation
        # This ensures consistent validation across all training sizes
        val_dataset = load_dataset(
            "latentcat/grayscale_image_aesthetic_3M",
            split="train[2999000:3000000]",
            cache_dir="~/.cache/huggingface/datasets"
        )
        print(f"✅ Validation split loaded: {len(val_dataset)} samples")
        print(f"   Validation uses: train[2999000:3000000] (last 1k)")
    except Exception as e:
        print(f"❌ Failed to load validation split: {e}")
        sys.exit(1)

    print("\n" + "=" * 60)
    print("✅ ALL CHECKS PASSED!")
    print("=" * 60)
    print(f"\nDataset cached at: ~/.cache/huggingface/datasets/")
    print(f"Training samples: {len(train_dataset)}")
    print(f"Validation samples: {len(val_dataset)}")
    print(f"\n⚠️  IMPORTANT: Validation always uses samples 2,999,000-2,999,999")
    print(f"   This ensures consistent validation across all training sizes")
    print(f"   (99k, 500k, 3M all use same validation set)")
    print(f"\nYou can now proceed with training!")
    print("The training script will automatically use this cached data.")

if __name__ == "__main__":
    verify_dataset()
EOF
```

**Make executable and run**:
```bash
chmod +x verify_dataset.py
python verify_dataset.py
```

**Expected output**: Should confirm dataset structure and cache the first 100k samples.

### Manual Preparation Checklist (Do This First!)

**Split into two phases to minimize GPU costs:**

---

## Part A: Local Preparation (BEFORE Launching GPU Instance)

**Do these steps on your local machine or any CPU instance - no GPU needed, $0 cost:**

#### Step 1: Get Your Authentication Tokens

**Prepare these before launching GPU:**
- **HuggingFace token**: https://huggingface.co/settings/tokens (create "Read" access token)
- **W&B API key**: https://wandb.ai/authorize

Save these somewhere - you'll need them on the GPU instance.

#### Step 2: Prepare Dataset Verification Script Locally

The full `verify_dataset.py` script is provided in the "Dataset Verification Script" section above (under Technical Implementation Plan).

You can either:
- Copy that script to a file on your local machine, OR
- Recreate it directly on the GPU instance in Part B below

No need to prepare this locally if you prefer to create it on the GPU instance.

---

## Part B: GPU Instance Setup (AFTER Launching GPU, BEFORE Training)

**Complete these steps on your GPU instance to avoid wasting GPU credits on training failures:**

**Estimated time: 30-60 minutes (mostly dataset download)**
**GPU credits used: ~$0.75-$1.50** (30-60 min @ $1.55/hr for A100)

#### Step 1: System Dependencies
```bash
# Update system packages
sudo apt-get update && sudo apt-get install -y git git-lfs build-essential

# Initialize Git LFS
git lfs install
```

#### Step 2: Python Environment with CUDA
```bash
# Install PyTorch with CUDA 11.8 (requires GPU instance!)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install core ML libraries
pip install diffusers transformers accelerate datasets

# Install utilities
pip install huggingface_hub pillow wandb xformers bitsandbytes
```

#### Step 3: Verify CUDA (Critical!)
```bash
# Verify CUDA availability - MUST show "True"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0)}')"
```

**Expected output:**
```
CUDA available: True
CUDA version: 11.8
GPU: NVIDIA A100-SXM4-40GB
```

**If CUDA shows False:** Stop and troubleshoot before proceeding!

#### Step 4: Clone Training Repository
```bash
# Clone HuggingFace diffusers
git clone https://github.com/huggingface/diffusers.git
cd diffusers/examples/controlnet

# Verify training script exists
ls -la train_controlnet_sdxl.py  # Should show the file
```

#### Step 5: Authentication Setup
```bash
# Login to HuggingFace (use token from Part A)
huggingface-cli login
# Paste your token when prompted

# Login to Weights & Biases (use API key from Part A)
wandb login
# Paste your API key when prompted
```

#### Step 6: Dataset Verification (CRITICAL!)
```bash
# Create the verify_dataset.py script using the code from
# "Dataset Verification Script" section at the top of this plan
# (See lines after "Technical Implementation Plan" heading)

# Once created, run it:
chmod +x verify_dataset.py
python verify_dataset.py
```

**Expected output:**
```
============================================================
SDXL ControlNet Brightness - Dataset Verification
============================================================

[1/4] Loading dataset subset (99k samples)...
This will download ~10-15GB to cache...
✅ Successfully loaded 99000 samples

[2/4] Verifying dataset structure...
✅ Columns correct: ['image', 'conditioning_image', 'text']

[3/4] Checking sample data...
✅ Image type: PIL.Image, size: (512, 512)
✅ Conditioning image type: PIL.Image, size: (512, 512)
✅ Caption type: str, length: 87 chars

[4/4] Checking validation split (last 1000 samples)...
✅ Validation split loaded: 1000 samples
   Validation uses: train[2999000:3000000] (last 1k)

============================================================
✅ ALL CHECKS PASSED!
============================================================

Dataset cached at: ~/.cache/huggingface/datasets/
Training samples: 99000
Validation samples: 1000

⚠️  IMPORTANT: Validation always uses samples 2,999,000-2,999,999
   This ensures consistent validation across all training sizes
   (99k, 500k, 3M all use same validation set)

You can now proceed with training!
```

#### Step 7: Pre-Flight Verification
```bash
# Check all packages are installed
pip list | grep -E "torch|diffusers|transformers|accelerate|datasets|xformers"

# Check disk space (need ~20GB free for checkpoints)
df -h ~

# Verify dataset cache exists
ls -lh ~/.cache/huggingface/datasets/
```

#### Step 8: Create Output Directory
```bash
# Create directory for training outputs
mkdir -p ~/controlnet-brightness-sdxl

# Return to training directory
cd ~/diffusers/examples/controlnet
```

---

## ✅ Preparation Complete!

**Once all Part B steps pass, you're ready to start GPU training.**

The training command (shown in Phase 3 below) will now:
- ✅ Use pre-downloaded dataset from cache (no re-download)
- ✅ Have all required libraries installed with CUDA support
- ✅ Be authenticated to HuggingFace and W&B
- ✅ Save checkpoints to the prepared directory

**Total preparation cost:** ~$0.75-$1.50 (vs $35 for full training)
**Why worth it:** Catches setup issues early without wasting 25 hours of GPU time

**Hardware Selection (Updated for Lightning.ai):**
- **🏆 RECOMMENDED FOR TESTING**: Single H100 on Free Tier
  - 99k training in 45 min for $1.88
  - Perfect for validation and hyperparameter tuning
  - 80GB VRAM allows good batch sizes
  - No subscription required
- **🚀 RECOMMENDED FOR PRODUCTION**: 6× H100 on Pro Plan ($20/month annual)
  - 3M training in 4 hours for $80 total
  - Can test multiple configs in one day
  - Net cost: ~$7/month after included credits
  - Cancel subscription after training complete
- **Not Recommended**: A100 - H100 is faster and more cost-efficient

### Phase 2: Dataset Preparation

**Dataset Split Strategy (for 99k quick training):**
- **Training**: 99,000 samples (`split="train[:99000]"`)
- **Validation**: 1,000 samples (`split="train[2999000:3000000]"`) - **ALWAYS last 1k**
- **Total loaded**: 100,000 samples (99k + last 1k of 3M dataset)

**⚠️ CRITICAL: Validation Always Uses Last 1000 Samples**
- All training sizes (99k, 500k, 3M) use `train[2999000:3000000]` for validation
- This ensures consistent validation set across all training runs
- Allows fair comparison of model quality at different training stages
- No overlap between training and validation for any training size

**Why This Matters:**
```
❌ WRONG: Using different validation sets for different training sizes
   - 99k training:  train[:99000] + validation train[99000:100000]
   - 500k training: train[:499000] + validation train[499000:500000]
   - 3M training:   train[:2999000] + validation train[2999000:3000000]
   Problem: Can't compare results! Each uses different validation data.

✅ CORRECT: Same validation set for all training sizes
   - 99k training:  train[:99000] + validation train[2999000:3000000]
   - 500k training: train[:499000] + validation train[2999000:3000000]
   - 3M training:   train[:2999000] + validation train[2999000:3000000]
   Benefit: Fair comparison across all training runs on same validation set.
```

### Understanding HuggingFace Dataset Caching

**Important**: The HuggingFace `datasets` library automatically caches all downloads to `~/.cache/huggingface/datasets/`. This means:

✅ **Cache reuse is automatic**: When the training script runs, it will check the cache first and reuse any previously downloaded data
✅ **No re-downloads**: You won't download the full 3M dataset if you've already downloaded a subset
✅ **The pre-download step is OPTIONAL**: The training command can handle downloading on its own

**Pre-download Benefits**:
- Verify dataset structure before training starts
- Separate download time from training time
- Ensure dataset access works before committing GPU hours

**Pre-download is NOT required**: The training script's `--max_train_samples=99000` parameter will work whether you pre-download or not.

### Dataset Download Options

**Option A: Pre-download for verification (RECOMMENDED)**
```python
from datasets import load_dataset

# This downloads and caches ~100k samples for verification
train_dataset = load_dataset(
    "latentcat/grayscale_image_aesthetic_3M",
    split="train[:99000]",
    cache_dir="~/.cache/huggingface/datasets"  # Default cache location
)

# Verify the dataset structure
print(f"Dataset size: {len(train_dataset)}")
print(f"Columns: {train_dataset.column_names}")
print(f"First sample keys: {train_dataset[0].keys()}")

# Check a sample
sample = train_dataset[0]
print(f"Image size: {sample['image'].size}")
print(f"Conditioning image size: {sample['conditioning_image'].size}")
print(f"Caption: {sample['text']}")
```

**Option B: Let training script handle download**
- Simply run the training command with `--dataset_name` and `--max_train_samples`
- The script will download to cache automatically
- Slightly riskier if there are dataset access issues

**Recommended:** Use the full `verify_dataset.py` script (see "Dataset Verification Script" section above) which implements Option A with comprehensive validation checks.

**Data Format Validation:**
- Verify columns: `image`, `conditioning_image`, `text`
- Check image resolution: 512×512 (will be upscaled to 1024×1024 by script)
- Validate grayscale format

**Steps Calculation (IMPORTANT):**
- Training samples: 99,000
- Batch size: 16
- Gradient accumulation: 4
- **Effective batch size**: 16 × 4 = 64 samples/step
- **Steps per epoch**: 99,000 ÷ 64 = 1,547 steps
- **For 2 epochs**: ~3,094 total steps

### Phase 3: Training Configuration

**Prerequisites:** Complete the "Manual Preparation Checklist" above before running this command.

**Training Command (Based on Latentcat Article):**
```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=99000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=16 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=2 \
  --checkpointing_steps=1500 \
  --validation_steps=1500 \
  --tracker_project_name="brightness-controlnet-sdxl" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam
```

**Key Parameters Explained:**
- `--max_train_samples=99000`: Limit to 99k samples (reserves 1k for validation)
- `--resolution=512`: Match dataset resolution (latentcat article used 512, not 1024)
- `--learning_rate=1e-5`: From latentcat article
- `--train_batch_size=16`: From latentcat article
- `--gradient_accumulation_steps=4`: Effective batch = 16 × 4 = 64
- `--num_train_epochs=2`: From latentcat article
- **`--checkpointing_steps=1500`**: Save every 1500 STEPS (~once per epoch)
  - Total training: ~3,094 steps for 2 epochs
  - Checkpoints at: 1500, 3000 steps
- **`--validation_steps=1500`**: Run validation every 1500 STEPS
- `--gradient_checkpointing`: Reduces VRAM usage
- `--use_8bit_adam`: Memory optimization
- `--enable_xformers_memory_efficient_attention`: Memory-efficient attention

**Critical Understanding - Steps vs Samples:**
- 1 STEP = processing 1 effective batch = 64 samples
- Checkpoint every 1500 steps = every 1500 × 64 = 96,000 samples (~1 epoch)
- NOT checkpoint every 1500 samples!
- Total steps for 2 epochs: 99,000 ÷ 64 × 2 = 3,094 steps

**VRAM Requirements with These Settings:**

The settings above are optimized for memory efficiency:
- `--mixed_precision="fp16"`: Halves memory usage
- `--gradient_checkpointing`: Trades compute for memory (~40% VRAM savings)
- `--use_8bit_adam`: Reduces optimizer state memory
- `--enable_xformers_memory_efficient_attention`: Memory-efficient attention

**Estimated VRAM usage:**
- SDXL base model (FP16): ~6-7GB
- ControlNet model: ~2.5GB
- 8-bit Adam optimizer states: ~3-4GB
- Gradients (with checkpointing): ~2-3GB
- Activations (batch 16, 512×512, gradient checkpointing): ~8-12GB
- **Total: ~22-28GB peak**

**GPU Compatibility:**

| GPU | VRAM | Will It Fit? | Batch Size | Notes |
|-----|------|--------------|------------|-------|
| **L4** | 24GB | ⚠️ Tight | 8-12 | Reduce `--train_batch_size` to 8 or 12 |
| **A100 40GB** | 40GB | ✅ Yes | 16 | **Recommended** - comfortable fit |
| **A100 80GB** | 80GB | ✅ Yes | 16-24 | Plenty of headroom, can increase batch |
| **H100 80GB** | 80GB | ✅ Yes | 16-24 | Fastest training, plenty of VRAM |

**Recommended: A100 40GB** - The settings will fit comfortably with batch size 16.

**If using L4 24GB**, modify the command:
```bash
# Change this line:
  --train_batch_size=16 \
# To:
  --train_batch_size=8 \
```
This keeps effective batch size = 8 × 4 = 32 (half of 64), but still works well.

### Accelerate Configuration for Multi-GPU Training

**Important:** Multi-GPU training on Lightning.ai requires the Pro plan ($20/month annual).

#### Single GPU (Free Tier) - No Configuration Needed

For single GPU training on Free tier, `accelerate launch` works without any configuration:

```bash
# No accelerate config needed - auto-detects single GPU
accelerate launch train_controlnet_sdxl.py [args...]
```

#### Multi-GPU (Pro Plan) - Configure Before Training

For 6× H100 training on Pro plan, configure accelerate once:

```bash
# Run configuration wizard
accelerate config
```

**Configuration Options for 6× H100:**

```yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU  # Use DataParallel for multiple GPUs
num_machines: 1  # Single machine with 6 GPUs
num_processes: 6  # One process per GPU
gpu_ids: all  # Use all available GPUs
mixed_precision: fp16  # Match training script
use_cpu: false
dynamo_backend: NO  # Disable torch.compile for compatibility
```

**Quick Config (Non-Interactive):**

```bash
# Create accelerate config file directly
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'EOF'
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 6
gpu_ids: all
mixed_precision: fp16
use_cpu: false
dynamo_backend: NO
EOF
```

**Verify Configuration:**

```bash
# Check configuration
accelerate env

# Test multi-GPU setup
accelerate test
```

**Launch Multi-GPU Training:**

```bash
# With configuration file, launch works same as single GPU
accelerate launch train_controlnet_sdxl.py [args...]

# Or specify config explicitly
accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \
  train_controlnet_sdxl.py [args...]
```

### H100-Optimized Training Parameters

The H100 GPU has **80GB VRAM** and **1979 TFLOPs**, allowing for larger batch sizes and better optimization than A100.

#### Optimal Batch Size for H100

**Default settings (designed for A100 40GB):**
```bash
--train_batch_size=16
--gradient_accumulation_steps=4
# Effective batch size: 16 × 4 = 64 samples/step
# VRAM usage: ~22-28GB
```

**H100-optimized settings (80GB VRAM):**
```bash
--train_batch_size=32  # 2× larger than A100
--gradient_accumulation_steps=4
# Effective batch size: 32 × 4 = 128 samples/step
# VRAM usage: ~40-48GB (still plenty of headroom)
```

**Aggressive H100 settings (maximum throughput):**
```bash
--train_batch_size=48  # 3× larger than A100
--gradient_accumulation_steps=2  # Reduce accumulation since batch is larger
# Effective batch size: 48 × 2 = 96 samples/step
# VRAM usage: ~55-65GB
# Faster training due to fewer gradient accumulation steps
```

#### Single H100 Training Command (99k samples)

**Optimized for H100 80GB:**

```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-h100"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=99000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=32 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=2 \
  --checkpointing_steps=750 \
  --validation_steps=750 \
  --tracker_project_name="brightness-controlnet-sdxl-h100" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --dataloader_num_workers=8 \
  --set_grads_to_none
```

**Key H100 Optimizations:**
- `--train_batch_size=32` (vs 16 on A100) - 2× larger batches
- `--gradient_accumulation_steps=4` - Effective batch = 128
- `--checkpointing_steps=750` - More frequent (every ~96k samples)
- `--dataloader_num_workers=8` - Faster data loading (H100 has 192 CPUs)
- `--set_grads_to_none` - Faster than zero_grad() on modern GPUs

**Expected Performance:**
- Steps per epoch: 99,000 ÷ 128 = 773 steps
- Total steps (2 epochs): ~1,546 steps
- Training time: ~38-45 minutes on single H100
- Checkpoints saved at: 750, 1500 steps

#### 6× H100 Training Command (3M samples) - Pro Plan

**For Pro plan multi-GPU training:**

```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-multi-h100"

# Configure accelerate for 6 GPUs (if not done already)
accelerate config  # Select MULTI_GPU, 6 processes

# Launch training
accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=2999000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=2 \
  --num_train_epochs=1 \
  --checkpointing_steps=2500 \
  --validation_steps=2500 \
  --tracker_project_name="brightness-controlnet-sdxl-3M" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --dataloader_num_workers=8 \
  --set_grads_to_none \
  --resume_from_checkpoint="latest"
```

**Multi-GPU Optimizations:**
- `--train_batch_size=24` per GPU × 6 GPUs = 144 samples per step (before accumulation)
- `--gradient_accumulation_steps=2` - Effective batch = 144 × 2 = 288
- `--checkpointing_steps=2500` - Save every ~720k samples
- `--resume_from_checkpoint="latest"` - Auto-resume if interrupted

**Expected Performance:**
- Effective batch size: 288 samples/step
- Steps per epoch: 2,999,000 ÷ 288 = ~10,413 steps
- Training time: ~4 hours on 6× H100
- Checkpoints: 2500, 5000, 7500, 10000 steps + final

#### Batch Size Selection Guide

| GPU Config | VRAM | Recommended batch_size | grad_accum_steps | Effective Batch | Training Speed |
|------------|------|------------------------|------------------|-----------------|----------------|
| Single L4 | 24GB | 8 | 4 | 32 | Slow (baseline) |
| Single A100 | 40GB | 16 | 4 | 64 | 2× faster than L4 |
| Single H100 | 80GB | 32 | 4 | 128 | 6× faster than L4 |
| 6× H100 (Pro) | 480GB | 24/GPU | 2 | 288 | 36× faster than L4 |

**Rule of Thumb:**
- Larger `train_batch_size` = better GPU utilization, faster training
- Larger `effective_batch_size` = more stable training, better convergence
- H100 can handle 2-3× larger batch sizes than A100 with same settings

#### Memory Optimization Tips

**If you encounter OOM (Out of Memory) errors on H100:**

1. **Reduce batch size incrementally:**
   ```bash
   --train_batch_size=32  # Start here
   --train_batch_size=24  # If OOM
   --train_batch_size=16  # If still OOM
   ```

2. **Enable additional memory optimizations:**
   ```bash
   --gradient_checkpointing \  # Already enabled
   --use_8bit_adam \           # Already enabled
   --enable_xformers_memory_efficient_attention \  # Already enabled
   --set_grads_to_none \       # Use this instead of zero_grad()
   ```

3. **Use gradient accumulation to maintain effective batch size:**
   ```bash
   # If reducing from batch_size=32 to batch_size=16
   --train_batch_size=16
   --gradient_accumulation_steps=8  # Double accumulation to keep effective=128
   ```

### Full 3M Dataset Training Options

**For maximum quality training on the complete dataset:**

#### Option A: Single H100 (Free Tier)

| Metric | Value |
|--------|-------|
| GPU | 1× H100 80GB (~$2.50/hr on Lightning.ai) |
| Dataset | 2,999,000 training + 1,000 validation |
| Estimated Duration | **~24 hours** |
| Estimated Cost | **$60 GPU credits** |
| Subscription Cost | **$0** (Free tier) |
| **Total Cost** | **$60** |
| Checkpoints | Every 5000 steps (~every 320k samples) |

**Pros:**
- ✅ Lowest total cost
- ✅ No subscription required
- ✅ Good for one-time training

**Cons:**
- ❌ 24 hours training time (must monitor)
- ❌ Can't quickly iterate if issues arise

#### Option B: 6× H100 (Pro Plan - $20/month)

| Metric | Value |
|--------|-------|
| GPU | 6× H100 80GB (~$2.50/hr × 6 = $15/hr) |
| Dataset | 2,999,000 training + 1,000 validation |
| Estimated Duration | **~4 hours** |
| Estimated Cost | **$60 GPU credits** |
| Subscription Cost | **$20/month** (annual billing) |
| **Total Cost** | **$80** |
| **Net Cost** | **$67** (after $13 annual credit value) |
| Checkpoints | Every 5000 steps (~every 320k samples) |

**Pros:**
- ✅ Completes in 4 hours vs 24 hours
- ✅ Can run same-day if needed
- ✅ Can test multiple configs quickly
- ✅ Net cost only $7/month after credits
- ✅ Can cancel after training

**Cons:**
- ❌ $20 upfront subscription cost

**Scaling Math:**
- Single H100: 99k in 45 min → 3M in 45 min × 30.3 = ~24 hours
- 6× H100: 24 hours ÷ 6 = ~4 hours

**Cost Comparison:**
- Free tier: $60, 24 hours wait
- Pro plan: $80, 4 hours wait
- **Price difference: $20 to save 20 hours**

#### Adjusted Training Command

```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-3M"

accelerate launch train_controlnet_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_DIR \
  --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
  --max_train_samples=2999000 \
  --conditioning_image_column="conditioning_image" \
  --image_column="image" \
  --caption_column="text" \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --resolution=512 \
  --learning_rate=1e-5 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=1 \
  --checkpointing_steps=5000 \
  --validation_steps=5000 \
  --validation_prompts="a beautiful garden scene" "modern city street" "abstract art pattern" \
  --tracker_project_name="brightness-controlnet-sdxl-3M" \
  --report_to="wandb" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --resume_from_checkpoint="latest"
```

#### Key Adjustments Explained

**Batch Size Scaling:**
- **`--train_batch_size=24`** (increased from 16)
  - H100 80GB has 2x VRAM of A100 40GB
  - Can safely increase batch size by 50%
  - Alternative: `--train_batch_size=32` if you have headroom
- **`--gradient_accumulation_steps=4`** (kept same)
  - Effective batch size: 24 × 4 = **96 samples/step**
  - If using batch_size=32: 32 × 4 = **128 samples/step**

**Dataset & Checkpointing:**
- **`--max_train_samples=2999000`** (vs 99,000 for quick training)
  - Training split: `train[:2999000]` (first 2,999,000 samples)
  - **Validation split: `train[2999000:3000000]` (SAME as 99k training!)**
  - ✅ This allows direct comparison of validation metrics between 99k and 3M training
  - ✅ No overlap between training and validation data
- **`--num_train_epochs=1`** (vs 2)
  - For 3M samples, 1 epoch is usually sufficient
  - Can increase to 2 if quality needs improvement
- **`--checkpointing_steps=5000`** (vs 1,500)
  - More frequent checkpoints would create too many files
  - 5000 steps = every ~480k samples
  - Total checkpoints: ~6-7 for full run
- **`--validation_steps=5000`** (matches checkpointing)
  - Run validation at each checkpoint

**Resumption:**
- **`--resume_from_checkpoint="latest"`**
  - CRITICAL for multi-day training
  - If training crashes, automatically resumes from last checkpoint
  - Saves days of retraining if interrupted

#### Training Math

**Steps Calculation:**
- Training samples: 2,999,000 (validation: 1,000)
- Effective batch size: 96 (or 128 with batch_size=32)
- Steps per epoch: 2,999,000 ÷ 96 = **31,240 steps**
  - With batch_size=32: 2,999,000 ÷ 128 = **23,429 steps**
- For 1 epoch: 31,240 steps total
- For 2 epochs: 62,480 steps total

**Checkpoints:**
- Saved every 5,000 steps
- Checkpoint locations: steps 5000, 10000, 15000, 20000, 25000, 30000, 31240 (final)
- Each checkpoint: ~2.5GB (ControlNet weights)
- Total storage: ~20GB for all checkpoints + training state

#### VRAM Usage (H100 80GB)

With batch_size=24:
- SDXL base model (FP16): ~6-7GB
- ControlNet model: ~2.5GB
- 8-bit Adam optimizer: ~3-4GB
- Gradients (with checkpointing): ~3-4GB
- Activations (batch 24): ~15-20GB
- **Total: ~35-40GB** ✅ Fits comfortably in 80GB

With batch_size=32 (max):
- Activations increase to ~20-25GB
- **Total: ~42-48GB** ✅ Still fits with headroom

**Recommended:** Start with batch_size=24, monitor VRAM in W&B, can increase to 32 if using <60GB.

#### Risk Mitigation for Long Training

**Strategy 1: Incremental Training**
```bash
# Start with 500k samples to validate approach
--max_train_samples=500000
# Cost: ~$150, Duration: ~75 hours
# If results good, continue to full 3M
```

**Strategy 2: Early Checkpoint Evaluation**
```bash
# Evaluate quality at checkpoints:
# - checkpoint-5000  (~480k samples, ~32 hours, ~$63)
# - checkpoint-10000 (~960k samples, ~64 hours, ~$127)
# - checkpoint-15000 (~1.4M samples, ~96 hours, ~$191)
# Can stop early if quality plateaus
```

**Strategy 3: Use Spot Instances**
- Many cloud providers offer H100 spot instances at 50-70% discount
- Cost could drop to $0.60-$1.00/hr (~$270-$600 total)
- Requires `--resume_from_checkpoint="latest"` (already included)
- Risk: Training may be interrupted, but will resume automatically

#### When to Use Full 3M Training

**Use 99k samples if:**
- ✅ First time training ControlNet
- ✅ Testing hyperparameters
- ✅ Budget constrained (<$50)
- ✅ Need results quickly (1-2 days)

**Use 3M samples if:**
- ✅ 99k results are good but want better quality
- ✅ Commercial production use (worth the investment)
- ✅ Training other ControlNet types (can reuse knowledge)
- ✅ Contributing to research/community (publishable results)
- ✅ Budget allows ($900-$1,200)

### Phase 4: Training Monitoring

**Setup Weights & Biases:**
```bash
wandb login
# Use wandb to track:
# - Loss curves
# - Validation images every 500 steps
# - Learning rate schedule
# - GPU utilization
```

**Checkpoints:**
- Saved every 1,500 steps to `$OUTPUT_DIR/checkpoint-{step}`
- With ~3,094 total steps, will get checkpoints at:
  - `checkpoint-1500` (~97% of epoch 1)
  - `checkpoint-3000` (~94% of epoch 2)
  - Final model at end of training
- Can resume training if interrupted: `--resume_from_checkpoint="./controlnet-brightness-sdxl/checkpoint-1500"`

**Validation:**
- Uses 1,000 validation samples from `train[99000:100000]`
- Runs every 1,500 steps (at checkpoints)
- W&B logs validation images and metrics
- No need for manual validation prompts/images

### Validation Metrics (Automatic)

**No configuration needed!** The training script automatically computes validation metrics:

**Loss Function (Automatic)**:
- **Default**: MSE (Mean Squared Error) loss between predicted and target images
- **Optional**: Huber loss - add `--loss_type="huber"` to training command
- **Formula**: `loss = F.mse_loss(model_pred.float(), target.float())`

**What Gets Logged to W&B**:
1. **Training loss** (every step)
2. **Validation loss** (every `--validation_steps=1500` steps)
3. **Validation images** (generated samples at validation time)
4. **Learning rate** (schedule tracking)
5. **GPU utilization** (hardware monitoring)

**Validation Process**:
1. Every 1500 steps, training pauses
2. Model generates images from validation set
3. Same MSE/Huber loss computed on validation samples
4. Loss + images logged to W&B
5. Training resumes

**No manual metrics needed** - everything is handled by the training script!

### Phase 5: Model Evaluation & Publishing

**Test Inference:**

First, install QR code library if needed:
```bash
pip install qrcode[pil]
```

Then run inference:
```python
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
import torch
import qrcode
from PIL import Image

# Generate QR code for testing
print("Generating QR code for https://google.com...")
qr = qrcode.QRCode(
    version=1,
    error_correction=qrcode.constants.ERROR_CORRECT_H,
    box_size=10,
    border=4,
)
qr.add_data("https://google.com")
qr.make(fit=True)

# Create QR code image and resize to 1024x1024
qr_image = qr.make_image(fill_color="black", back_color="white")
qr_image = qr_image.resize((1024, 1024), Image.LANCZOS)
print(f"QR code generated: {qr_image.size}")

# Load trained ControlNet
print("Loading ControlNet model...")
controlnet = ControlNetModel.from_pretrained(
    "./controlnet-brightness-sdxl/checkpoint-3000",  # or checkpoint-1500
    torch_dtype=torch.float16
)

# Load SDXL pipeline with ControlNet
print("Loading SDXL pipeline...")
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    torch_dtype=torch.float16
)
pipe.enable_xformers_memory_efficient_attention()
pipe.to("cuda")

# Generate artistic QR code
print("Generating artistic QR code...")
image = pipe(
    prompt="a beautiful garden scene with flowers, highly detailed, professional photography",
    negative_prompt="blurry, low quality, distorted",
    image=qr_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.45,  # Adjust 0.3-0.6 for balance
    guidance_scale=7.5,
).images[0]

# Save results
qr_image.save("original_qr.png")
image.save("artistic_qr_result.png")
print("✅ Done! Check artistic_qr_result.png")
print("📱 Scan with phone to verify QR code still works!")
```

**Testing Different Conditioning Scales:**
```python
# Test multiple conditioning scales to find best balance
for scale in [0.3, 0.4, 0.5, 0.6]:
    print(f"Testing conditioning_scale={scale}...")
    image = pipe(
        prompt="a beautiful garden scene with flowers",
        image=qr_image,
        num_inference_steps=30,
        controlnet_conditioning_scale=scale,
    ).images[0]
    image.save(f"result_scale_{scale}.png")
```

**Publish to HuggingFace Hub:**
```bash
# After validation
huggingface-cli login
python scripts/upload_to_hub.py \
  --model_path="./controlnet-brightness-sdxl/checkpoint-50000" \
  --repo_name="Oysiyl/controlnet-brightness-sdxl"
```

## Cost-Benefit Analysis

### Investment Required (Updated for Single H100)

**Strategy A: Free Tier (99k Quick Test)**
| Component | Cost/Time |
|-----------|-----------|
| GPU Credits (99k samples, 2 epochs, single H100) | $1.88 |
| Setup Time | 1-2 hours |
| Training Duration | **45 minutes** ⚡ |
| Testing & Validation | 2-3 hours |
| **Total Time** | **~4-6 hours** (same day) |
| **Total Cost** | **$1.88** |

**Strategy B: Pro Plan (Full 3M Training)**
| Component | Cost/Time |
|-----------|-----------|
| Pro Subscription (can cancel after) | $20/month |
| Included credits value | -$13 (240 credits/year) |
| GPU Credits (3M samples, 1 epoch, 6×H100) | $60 |
| Setup Time | 1-2 hours |
| Training Duration | **4 hours** ⚡ |
| Testing & Validation | 2-3 hours |
| **Total Time** | **~8 hours** (same day) |
| **Total Cost** | **$80** ($20 sub + $60 GPU) |
| **Net Cost** | **$67** (after annual credit value) |

**Strategy C: All-in-One (Pro Plan, Test Everything)**
| Component | Cost/Time |
|-----------|-----------|
| Pro Subscription | $20/month |
| 99k test (6×H100) | $1.88 (7.5 min) |
| 500k training (6×H100) | $10 (40 min) |
| 3M training (6×H100) | $60 (4 hours) |
| **Total GPU Time** | **~5 hours** |
| **Total GPU Cost** | **$71.88** |
| **Total with Sub** | **$91.88** |
| **Net after credits** | **$78.88** |

**Recommendation:** Start with Strategy A ($1.88), upgrade to Strategy B if promising

### Value Delivered
1. **Unblocks SDXL Migration**: Enables upgrade from SD 1.5 to higher quality SDXL
2. **Better Image Quality**: SDXL produces superior 1024×1024 images vs SD 1.5's 512×512
3. **Community Value**: First public SDXL brightness ControlNet (potential citations/recognition)
4. **No Alternatives**: Cannot proceed with SDXL QR code generation without this model
5. **Reusable Asset**: Once trained, can be used indefinitely

### Risk Mitigation
- **Start Small**: Train on 100k samples first (~$40, 1-2 days)
- **Evaluate Early**: Check quality at checkpoint-5000, checkpoint-10000
- **Iterative Approach**: Extend training only if initial results are promising
- **Fallback**: Can continue using SD 1.5 if SDXL training fails

## Alternative Approaches Considered

### Option 1: Train Brightness ControlNet for SDXL (RECOMMENDED)
- **Pros**:
  - Proven training pipeline (diffusers script exists)
  - Same dataset as original SD 1.5 model
  - Good quality/cost balance
  - Community support and documentation
  - License-friendly (SDXL is permissive)
- **Cons**:
  - Requires GPU time investment ($75-$300)
  - 4-5 days training duration
  - Still requires 24GB+ VRAM for inference
- **Cost**: $155 for 500k samples on A100 (recommended)
- **Risk**: Low - well-documented process
- **Verdict**: ✅ **Best choice for production use**

### Option 2: Train Brightness ControlNet for Flux Schnell
- **Pros**:
  - Apache 2.0 license (fully commercial)
  - Faster inference than Flux Dev (3× speedup)
  - Same architecture as Dev (12B parameters)
  - Would be first-of-its-kind community contribution
- **Cons**:
  - ⚠️ **No existing training scripts for Schnell**
  - Would need to adapt Flux Dev training code
  - Unknown if distillation affects ControlNet training
  - Still requires 32-40GB VRAM (heavier than SDXL)
  - Higher risk and uncertainty
  - Longer training time due to larger model
- **Cost**: $200-$500 (estimated, higher due to larger model)
- **Risk**: High - experimental, no precedent
- **Verdict**: 🔬 **Experimental - only if willing to pioneer new territory**

### Option 3: Use SDXL LoRA for Brightness Control
- **Pros**: No training required, immediate availability
- **Cons**: Less precise control than dedicated ControlNet, may not work well for QR codes
- **Verdict**: Worth testing but likely insufficient for QR code use case

### Option 4: Latent Initialization Approach
- **Pros**: Architecture-agnostic, works with both SDXL and Flux
- **Cons**: Less control over brightness distribution, requires experimentation
- **Verdict**: Good fallback but not as reliable as ControlNet

### Option 5: Wait for Community Release
- **Pros**: Zero cost, zero effort
- **Cons**: No timeline, may never happen, blocks project progress
- **Verdict**: Not viable for active development

### Option 6: Hybrid Tile ControlNet + Post-Processing
- **Pros**: Tile ControlNet available for SDXL
- **Cons**: Doesn't address brightness control directly
- **Verdict**: Complementary but not a replacement

**Conclusion**: Training SDXL ControlNet is the most reliable solution. Flux Schnell is interesting for research but carries significant execution risk.

## Recommended Action Plan

### Immediate Setup (Day 1)
1. **Launch Lightning AI Instance**: A100 40GB GPU
2. **Run Setup Commands**: Install all dependencies (see Phase 3 above)
3. **Authenticate**: HuggingFace and W&B login
4. **Clone Diffusers**: Get training scripts

### Training Phase (Day 1 - Morning) ⚡
5. **Start Training**: Launch training with 99k samples (~45 minutes on 8×H100)
6. **Monitor W&B**: Track loss curves and validation images in real-time
7. **First Checkpoint**: Review checkpoint-1500 (~25 minutes in)
8. **Training Complete**: Total ~45 minutes for full 2-epoch run

### Evaluation Phase (Day 1 - Afternoon)
9. **Post-Training Validation**: Run inference on 1k validation set
10. **QR Code Testing**: Test with actual QR codes, measure scannability
11. **Quality Assessment**: Compare to SD 1.5 brightness ControlNet
12. **Decision Point**:
    - If quality good: Publish and integrate (move to next phase)
    - If needs improvement: Launch 2nd training run with adjusted hyperparameters (~45 min)
    - Can try 3-4 different configurations in same day!

### Optional: Full Dataset Training (Day 1 - Evening)
12a. **If 99k results promising**: Launch full 3M training (~2 hours on 8×H100)
12b. **Monitor overnight**: W&B tracks progress automatically
12c. **Next morning**: Evaluate final model quality

### Integration Phase (Day 2)
13. **Publish to HuggingFace**: Upload best checkpoint
14. **Update app_sdxl.py**: Integrate new ControlNet model
15. **Production Testing**: End-to-end QR code generation tests
16. **Documentation**: Update README with SDXL support

**Total Timeline: 1-2 days** (vs previous estimate of 5 days)

## Success Metrics

1. **QR Code Scannability**: 95%+ scan rate on generated images
2. **Visual Quality**: Subjective improvement over SD 1.5 outputs
3. **Control Precision**: Ability to adjust brightness strength (0.0-1.0 range)
4. **Training Loss**: Convergence to < 0.1 validation loss
5. **Community Adoption**: Positive feedback if published publicly

## Critical Files to Modify

Once model is trained:
- `app.py:48-56` - Add SDXL ControlNet loading
- `app.py:1880-1886` - Update standard pipeline with SDXL support
- `app.py:2343-2349` - Update artistic pipeline with SDXL support
- `app_sdxl.py` - Complete SDXL-specific implementation
- `comfy/sd_configs/` - Add SDXL configuration if needed

## Flux Schnell Training Considerations (If Pursuing)

If you decide to pursue Flux Schnell ControlNet training despite the risks:

**Required Adaptations:**
1. **Training Script Modification**: Adapt `train_controlnet_flux.py` to work with Schnell
   - Model path: `black-forest-labs/FLUX.1-schnell` instead of `FLUX.1-dev`
   - Verify architecture compatibility (distillation may affect ControlNet layers)
   - Test with small pilot run (1000 steps) before full training

2. **Hardware Requirements**:
   - Minimum: H100 (80GB VRAM) - $1.99/hr
   - A100 40GB likely insufficient for Flux training
   - Estimated training: 150-250 hours on H100 (~$300-$500)

3. **Dataset Considerations**:
   - Flux uses 1024×1024 resolution (same as SDXL)
   - Dataset would need upscaling from 512×512 or re-preprocessing
   - Consider starting with 100k subset for validation

4. **Verification Steps**:
   - Test if Schnell's distillation preserves ControlNet training capability
   - Compare with Flux Dev training (if available for testing)
   - Validate brightness control precision matches SD 1.5 quality

**Risk Assessment**:
- **Technical Risk**: High - no proven training path
- **Time Risk**: Medium-High - debugging could extend timeline significantly
- **Cost Risk**: High - may require multiple training attempts ($500+)
- **Success Probability**: 50-70% (educated guess based on architecture similarity)

**Recommendation**: Only pursue if:
1. SDXL training completes successfully first (de-risk approach)
2. You're willing to contribute pioneering work to the community
3. Budget allows for experimental work ($500-1000 total including failed attempts)

## References

### SDXL Training
- **SDXL Training Script**: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sdxl.py
- **Dataset**: https://huggingface.co/datasets/latentcat/grayscale_image_aesthetic_3M
- **Reference Article**: https://latentcat.com/en/blog/brightness-controlnet
- **Original SD 1.5 Model**: https://huggingface.co/latentcat/latentcat-controlnet
- **Lightning AI**: https://lightning.ai/

### Flux Information
- **Flux Schnell Model**: https://huggingface.co/black-forest-labs/FLUX.1-schnell
- **Flux Dev Training Script**: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_flux.py
- **XLabs-AI Flux ControlNets**: https://huggingface.co/XLabs-AI/flux-controlnet-collections
- **Flux Comparison Guide**: [Flux Dev vs Schnell Comparison](https://www.stablediffusiontutorials.com/2025/04/flux-schnell-dev-pro.html)
- **Flux Architecture Discussion**: [GitHub Issue #408](https://github.com/black-forest-labs/flux/issues/408)
- **License Comparison**: [Flux Model Guide](https://stable-diffusion-art.com/flux/)

## Final Recommendation (Updated December 2024 - Lightning.ai)

**Proceed with SDXL Brightness ControlNet Training on Single H100 (Free Tier)**

Based on Lightning.ai pricing and multi-GPU requirements, the recommended path is:

### Phase 1: Quick Validation (Free Tier)
1. **Start with 99k samples on single H100**
   - Cost: $1.88 in GPU credits
   - Duration: 45 minutes
   - Platform: Lightning.ai Free tier
   - Purpose: Validate training pipeline and quality

### Phase 2: Production Training (Choose Based on Phase 1)

**Option A: Budget Approach (Free Tier)**
- Run full 3M dataset on single H100
- Cost: $60 GPU credits, $0 subscription
- Duration: 24 hours
- Total: $60
- Best for: One-time training, have patience

**Option B: Speed Approach (Pro Plan)**
- Upgrade to Pro plan ($20/month annual)
- Run full 3M dataset on 6× H100
- Cost: $60 GPU + $20 subscription = $80
- Net cost: $67 (after $13 annual credit value)
- Duration: 4 hours
- Best for: Need results same day, may iterate

### Recommended Strategy

**Most Cost-Effective Path:**
1. **Day 1 Morning**: Run 99k test on Free tier ($1.88, 45 min)
2. **Day 1 Afternoon**: Evaluate results
3. **If promising**: 
   - **Budget route**: Start 3M on Free tier ($60, 24 hrs) → Total: $61.88
   - **Speed route**: Upgrade to Pro, run 3M ($80, 4 hrs) → Total: $81.88
4. **Cancel Pro** after training if using speed route

### Why This Path

- **Low Risk Entry**: Only $1.88 to validate entire pipeline
- **Flexible Scaling**: Choose speed vs cost based on results
- **Proven Pipeline**: HuggingFace Diffusers battle-tested script
- **Reference Success**: Original SD 1.5 model trained on same dataset
- **H100 Advantage**: 6.3× faster than A100 even on single GPU
- **Cost-Effective**: $62-$82 total (vs $900+ on older plans)
- **Unblocks Migration**: Enables full SDXL upgrade from SD 1.5

### Cost Breakdown Comparison

| Approach | Hardware | Duration | GPU Cost | Sub Cost | Total | Timeline |
|----------|----------|----------|----------|----------|-------|----------|
| **Old Plan (A100)** | Single A100 | 180 hours | $900-1,200 | $0 | $900-1,200 | 1 week |
| **NEW: Free Tier** | Single H100 | 24.75 hours | $61.88 | $0 | **$61.88** | 2 days |
| **NEW: Pro Plan** | 6× H100 | 4.75 hours | $61.88 | $20 | **$81.88** | 1 day |

**Savings vs Old Plan:**
- Free tier: Save $838-$1,138 and 6 days
- Pro plan: Save $818-$1,118 and 6 days

### Pro Plan ROI Analysis

**When is Pro worth it?**
- $20 extra to save 20 hours (24h → 4h)
- = **$1/hour saved**
- Plus: Can test multiple hyperparameters same day
- Plus: Includes $13/year in credits

**Get Pro if:**
- ✅ You value time over $1/hour
- ✅ Planning to iterate on hyperparameters
- ✅ Need results urgently
- ✅ Want to test 99k + 500k + 3M in one session

**Skip Pro if:**
- ✅ Doing one-time training only
- ✅ Can wait 24 hours
- ✅ Budget constrained
- ✅ 99k test was sufficient

### Next Steps

Once plan is approved:
1. Set up Lightning AI account with A100 GPU access
2. Clone diffusers repository and install requirements
3. Verify dataset access and download capabilities
4. Prepare validation QR codes for quality testing
5. Launch training with recommended hyperparameters
6. Monitor via Weights & Biases for loss curves and validation images
7. Evaluate checkpoints at 10k, 25k, 50k steps
8. Complete training and publish to HuggingFace Hub
9. Integrate into `app_sdxl.py` for production use

**Flux Schnell** remains an option for future exploration once SDXL is production-ready, but is deprioritized due to experimental nature and higher resource requirements.