Spaces:
Sleeping
Sleeping
| # Training ControlNet Brightness for SDXL - Feasibility Analysis | |
| ## Executive Summary | |
| Training a brightness ControlNet for SDXL is **technically feasible and recommended** as the critical upgrade path from SD 1.5 to SDXL for QR code generation. This model is essential because no public SDXL brightness ControlNet exists. | |
| **Key Estimates (Updated December 2024 - Single H100 GPU):** | |
| - **Time**: 45 minutes (99k samples) to 24 hours (3M samples) on single H100 | |
| - **Cost**: $13 (99k) to $418 (3M) in GPU credits | |
| - **Platform**: Lightning.ai with optional Pro plan ($20/month for multi-GPU) | |
| - **Priority**: High - enables SDXL migration for QR code generation | |
| - **Complexity**: Medium - well-documented training pipeline with reference implementation | |
| **Recommended Path:** | |
| - Start with single H100 for 99k samples (~45 min, $13) | |
| - If successful, optionally upgrade to Pro plan for faster 3M training | |
| - Total investment: $13-$138 depending on training size and plan choice | |
| ## Background Context | |
| ### Current Implementation (SD 1.5) | |
| - **Location**: `app.py:1880-1886, 2343-2349` | |
| - **Model**: `control_v1p_sd15_brightness.safetensors` from latentcat/latentcat-controlnet | |
| - **Purpose**: Controls QR code pattern visibility via brightness conditioning | |
| - **Critical**: Essential for QR code readability - cannot be removed | |
| ### Why SDXL Brightness ControlNet is Needed | |
| 1. **No Public Alternative**: No SDXL-equivalent brightness ControlNet exists on HuggingFace | |
| 2. **Migration Blocker**: Current SD 1.5 brightness ControlNet incompatible with SDXL architecture | |
| 3. **QR Readability**: Brightness control is core to balancing aesthetic quality with QR scannability | |
| 4. **Flux is Too Heavy**: SDXL is the practical upgrade path (Flux requires 32-40GB VRAM) | |
| ### Flux Model Landscape (Updated Analysis) | |
| **Flux Schnell (Apache 2.0 License)** | |
| - **License**: Fully open for commercial use - no restrictions | |
| - **Architecture**: Same 12B parameters as Flux Dev, but distilled for speed (3Γ faster) | |
| - **Quality**: Lower than Dev due to aggressive distillation trading detail for speed | |
| - **VRAM**: Still requires 32-40GB (same as Dev) | |
| - **ControlNet Status**: β οΈ **No existing ControlNet models or training scripts** | |
| - **Training Risk**: Would require adapting Flux Dev training script - pioneering work | |
| - **Community**: Active requests for Schnell ControlNets but no official releases | |
| **Flux Dev (Non-Commercial License)** | |
| - **License**: Non-commercial only - cannot be used for commercial QR code generation | |
| - **ControlNet Status**: β Extensive support (XLabs-AI, InstantX collections) | |
| - **Training Scripts**: Available from XLabs-AI and HuggingFace Diffusers | |
| - **Quality**: Superior to Schnell, but license restrictions make it unsuitable | |
| **Flux Pro (Commercial API)** | |
| - **License**: API-only, commercial pricing | |
| - **Status**: Not suitable for self-hosted training | |
| **Assessment**: While Flux Schnell has an attractive license, the lack of proven ControlNet training pipeline makes it **high-risk**. SDXL remains the **proven, practical choice**. | |
| ## Hardware Selection & Platform Strategy | |
| ### Lightning.ai Pricing Tiers (December 2024) | |
| Lightning.ai offers different tiers with varying multi-GPU capabilities: | |
| | Plan | Cost | Multi-GPU | Max GPUs | Credits Included | Best For | | |
| |------|------|-----------|----------|------------------|----------| | |
| | **Free** | $0 | β No | 1 | 15/month | Quick 99k test | | |
| | **Pro** | **$20/month** (annual) | β Yes | 6 | 240/year (~$13/mo) | **Recommended** | | |
| | Teams | $119/month (annual) | β Yes | 12 | 600/year | Large teams | | |
| **Pro Plan Benefits:** | |
| - Only **$20/month** if paid annually ($240/year vs $600 monthly) | |
| - Includes **240 credits/year** = ~$13 of free GPU time | |
| - **Net cost: ~$7/month** after credits | |
| - Multi-GPU training up to 6 GPUs | |
| - Can cancel after training completes | |
| ### GPU Comparison Analysis (Lightning.ai) | |
| **Single GPU Performance:** | |
| | GPU | TFLOPs | Memory | Cost/hr | 99k Time | 99k Cost | 3M Time | 3M Cost | | |
| |-----|--------|--------|---------|----------|----------|---------|---------| | |
| | A100 | 312 | 40GB | ~$1.50 | 4-6 hours | $6-9 | 120-180 hours | $180-270 | | |
| | **H100** | **1979** | **80GB** | **~$2.50** | **45 min** | **$1.88** | **24 hours** | **$60** | | |
| **Cost Efficiency:** | |
| - H100 is **6.3Γ faster** than A100 (1979 vs 312 TFLOPs) | |
| - H100 costs **1.67Γ more** per hour on Lightning.ai | |
| - **Net result: 3.8Γ better cost efficiency** | |
| ### Single vs Multi-GPU: Should You Get Pro Plan? | |
| #### Option A: Free Plan (Single H100) | |
| | Training Size | Duration | GPU Cost | Total Cost | Timeline | | |
| |---------------|----------|----------|------------|----------| | |
| | 99k samples | 45 min | $1.88 | **$1.88** | Same day | | |
| | 500k samples | 4 hours | $10 | **$10** | Same day | | |
| | 3M samples | 24 hours | $60 | **$60** | 1-2 days | | |
| **Pros:** | |
| - β $0 subscription cost | |
| - β Very cheap for 99k testing | |
| - β Good for one-off training | |
| **Cons:** | |
| - β 24 hours for 3M training (must babysit) | |
| - β Can't test multiple hyperparameters quickly | |
| - β Limited to 15 free credits/month | |
| #### Option B: Pro Plan (6Γ H100) | |
| | Training Size | Duration | GPU Cost | Subscription | Total Cost | Timeline | | |
| |---------------|----------|----------|--------------|------------|----------| | |
| | 99k samples | **7.5 min** | $1.88 | $20 | **$21.88** | Minutes | | |
| | 500k samples | **40 min** | $10 | $20 | **$30** | Same hour | | |
| | 3M samples | **4 hours** | $60 | $20 | **$80** | Same day | | |
| **Multi-GPU costs same because:** | |
| - 6Γ GPUs = 6Γ faster | |
| - 6Γ GPUs = 6Γ more expensive per hour | |
| - Net: Same total GPU cost, much faster completion | |
| **Pros:** | |
| - β 3M training finishes in 4 hours (vs 24) | |
| - β Can test 3-4 hyperparameter configs in one day | |
| - β Includes 240 credits/year (~$13 value) | |
| - β Real net cost: $7/month after credits | |
| - β Can cancel after training done | |
| **Cons:** | |
| - β $20 upfront cost (annual commitment) | |
| ### Recommendation Matrix | |
| **If you're doing ONE 99k training run:** | |
| - β **Use Free tier** ($1.88 total, 45 min) | |
| - Skip Pro plan - not worth $20 for 7.5 min vs 45 min | |
| **If you're doing 500k OR 3M training:** | |
| - β **Get Pro plan** ($20/month) | |
| - 3M: 4 hours vs 24 hours = worth it | |
| - Can test multiple configs same day | |
| - Net cost after credits: ~$7/month | |
| **If you're doing multiple experiments:** | |
| - β **Definitely get Pro plan** | |
| - Test 99k + 500k + 3M all in one day | |
| - Total time: ~5 hours vs 30+ hours | |
| - Total cost: $20 + ~$72 GPU = $92 | |
| - Cancel Pro after training complete | |
| **Most Cost-Effective Strategy:** | |
| 1. Start with **Free tier** for 99k test ($1.88, 45 min) | |
| 2. If results promising, upgrade to **Pro** for 3M training | |
| 3. Run full training in 4 hours | |
| 4. Cancel Pro after done | |
| 5. Total: $20 Pro + $60 GPU + $1.88 test = **$81.88** | |
| ### Updated Training Timeline Estimates | |
| **Single H100 (Free Tier):** | |
| | Training Size | Duration | Total Cost | When to Use | | |
| |---------------|----------|------------|-------------| | |
| | **99k samples** | 45 min | $1.88 | Quick validation, hyperparameter testing | | |
| | **500k samples** | 4 hours | $10 | Medium quality, budget option | | |
| | **3M samples** | 24 hours | $60 | Max quality, have patience | | |
| **6Γ H100 (Pro Plan at $20/month):** | |
| | Training Size | Duration | Total Cost | When to Use | | |
| |---------------|----------|------------|-------------| | |
| | **99k samples** | 7.5 min | $21.88 | Ultra-fast iteration | | |
| | **500k samples** | 40 min | $30 | Production ready, same day | | |
| | **3M samples** | 4 hours | $80 | Best quality, same day results | | |
| ## Training Strategy | |
| ### Dataset: latentcat/grayscale_image_aesthetic_3M | |
| - **Size**: 3 million images at 512Γ512 resolution | |
| - **Format**: Parquet files with image/conditioning_image/text columns | |
| - **Same Dataset**: Used for original SD 1.5 brightness ControlNet training | |
| - **License**: Latent Cat (check license before commercial use) | |
| - **Quality**: Pre-processed grayscale images with aesthetic filtering | |
| ### Reference Training Results (from latentcat article) | |
| | Configuration | Samples | Hardware | Duration | Cost Estimate | | |
| |--------------|---------|----------|----------|---------------| | |
| | Original SD 1.5 | 100k | A6000 | 13 hours | ~$20 (est.) | | |
| | Original SD 1.5 | 3M | TPU v4-8 | 25 hours | N/A (TPU) | | |
| ### SDXL Training Scaling Estimates | |
| **Updated Based on Latentcat Article:** | |
| - Training at 512Γ512 resolution (NOT 1024Γ1024) - matches dataset and original training | |
| - SDXL has larger UNet architecture (~2.5GB vs 1.7GB for SD 1.5) | |
| - Expected slowdown: 2-3Γ compared to SD 1.5 training | |
| **Time Estimates for 99k Training Samples (Lightning.ai Single H100):** | |
| ## Calculation Methodology | |
| **Baseline Reference:** | |
| - Latentcat article: 100k samples on A6000 = 13 hours (SD 1.5) | |
| - SDXL overhead: 13h Γ 2.5 (larger architecture) = ~32.5 hours for 100k | |
| - A6000 β A100 in performance (~300-312 TFLOPs) | |
| **Scaling to H100:** | |
| - A100: 312 TFLOPs β ~4-6 hours for 99k samples | |
| - H100: 1979 TFLOPs β 6.3Γ faster | |
| - **H100 single GPU: ~38-57 minutes for 99k samples** | |
| **Multi-GPU Scaling (Pro Plan):** | |
| - 6Γ H100 GPUs = 6Γ faster = ~7.5 minutes for 99k | |
| - Total cost stays same (6Γ faster but 6Γ more expensive/hour) | |
| ## Recommended Configurations | |
| **π OPTION 1: Free Tier (Single H100) - Best for Testing** | |
| - **99k samples**: 45 min, $1.88 | |
| - **500k samples**: 4 hours, $10 | |
| - **3M samples**: 24 hours, $60 | |
| - **Best for:** One-off training, budget-conscious, have patience | |
| **π OPTION 2: Pro Plan (6Γ H100) - Best for Production** | |
| - **Subscription**: $20/month (annual), includes $13 credits = **$7 net cost** | |
| - **99k samples**: 7.5 min, $21.88 total ($1.88 GPU + $20 sub) | |
| - **500k samples**: 40 min, $30 total ($10 GPU + $20 sub) | |
| - **3M samples**: 4 hours, $80 total ($60 GPU + $20 sub) | |
| - **Best for:** Multiple experiments, 3M training, need results same day | |
| **Cost Comparison Summary:** | |
| | Scenario | Free Tier | Pro Plan | Savings (Pro) | | |
| |----------|-----------|----------|---------------| | |
| | Single 99k test | $1.88 | $21.88 | β $20 more | | |
| | Single 3M training | $60 | $80 | β $20 more | | |
| | 99k + 500k + 3M | $71.88 (30 hours) | $92 (5 hours) | β Save 25 hours | | |
| | 3+ experiments | $71.88+ (30+ hours) | $92 (5-6 hours) | β Save 24+ hours | | |
| **Recommendation:** | |
| - For single 99k test: **Use Free Tier** (not worth $20 for speed) | |
| - For 3M training: **Consider Pro** (4 hrs vs 24 hrs = big difference) | |
| - For multiple runs: **Definitely Pro** (can test everything in one day) | |
| ## Technical Implementation Plan | |
| ### Dataset Verification Script | |
| **Create this script to verify dataset before training:** | |
| ```bash | |
| cat > verify_dataset.py << 'EOF' | |
| #!/usr/bin/env python3 | |
| """ | |
| Dataset verification script for SDXL ControlNet Brightness training. | |
| Downloads a subset of the dataset and verifies structure. | |
| Usage: python verify_dataset.py | |
| """ | |
| from datasets import load_dataset | |
| from PIL import Image | |
| import sys | |
| def verify_dataset(): | |
| print("=" * 60) | |
| print("SDXL ControlNet Brightness - Dataset Verification") | |
| print("=" * 60) | |
| print("\n[1/4] Loading dataset subset (99k samples)...") | |
| print("This will download ~10-15GB to cache...") | |
| try: | |
| train_dataset = load_dataset( | |
| "latentcat/grayscale_image_aesthetic_3M", | |
| split="train[:99000]", | |
| cache_dir="~/.cache/huggingface/datasets" | |
| ) | |
| print(f"β Successfully loaded {len(train_dataset)} samples") | |
| except Exception as e: | |
| print(f"β Failed to load dataset: {e}") | |
| sys.exit(1) | |
| print("\n[2/4] Verifying dataset structure...") | |
| expected_columns = {"image", "conditioning_image", "text"} | |
| actual_columns = set(train_dataset.column_names) | |
| if actual_columns == expected_columns: | |
| print(f"β Columns correct: {train_dataset.column_names}") | |
| else: | |
| print(f"β Column mismatch!") | |
| print(f" Expected: {expected_columns}") | |
| print(f" Got: {actual_columns}") | |
| sys.exit(1) | |
| print("\n[3/4] Checking sample data...") | |
| sample = train_dataset[0] | |
| # Check images | |
| if isinstance(sample['image'], Image.Image): | |
| img_size = sample['image'].size | |
| print(f"β Image type: PIL.Image, size: {img_size}") | |
| else: | |
| print(f"β Unexpected image type: {type(sample['image'])}") | |
| if isinstance(sample['conditioning_image'], Image.Image): | |
| cond_size = sample['conditioning_image'].size | |
| print(f"β Conditioning image type: PIL.Image, size: {cond_size}") | |
| else: | |
| print(f"β Unexpected conditioning image type: {type(sample['conditioning_image'])}") | |
| if isinstance(sample['text'], str): | |
| caption_len = len(sample['text']) | |
| print(f"β Caption type: str, length: {caption_len} chars") | |
| print(f" Sample caption: '{sample['text'][:100]}...'") | |
| else: | |
| print(f"β Unexpected caption type: {type(sample['text'])}") | |
| print("\n[4/4] Checking validation split (last 1000 samples)...") | |
| try: | |
| # IMPORTANT: Always use last 1000 samples for validation | |
| # This ensures consistent validation across all training sizes | |
| val_dataset = load_dataset( | |
| "latentcat/grayscale_image_aesthetic_3M", | |
| split="train[2999000:3000000]", | |
| cache_dir="~/.cache/huggingface/datasets" | |
| ) | |
| print(f"β Validation split loaded: {len(val_dataset)} samples") | |
| print(f" Validation uses: train[2999000:3000000] (last 1k)") | |
| except Exception as e: | |
| print(f"β Failed to load validation split: {e}") | |
| sys.exit(1) | |
| print("\n" + "=" * 60) | |
| print("β ALL CHECKS PASSED!") | |
| print("=" * 60) | |
| print(f"\nDataset cached at: ~/.cache/huggingface/datasets/") | |
| print(f"Training samples: {len(train_dataset)}") | |
| print(f"Validation samples: {len(val_dataset)}") | |
| print(f"\nβ οΈ IMPORTANT: Validation always uses samples 2,999,000-2,999,999") | |
| print(f" This ensures consistent validation across all training sizes") | |
| print(f" (99k, 500k, 3M all use same validation set)") | |
| print(f"\nYou can now proceed with training!") | |
| print("The training script will automatically use this cached data.") | |
| if __name__ == "__main__": | |
| verify_dataset() | |
| EOF | |
| ``` | |
| **Make executable and run**: | |
| ```bash | |
| chmod +x verify_dataset.py | |
| python verify_dataset.py | |
| ``` | |
| **Expected output**: Should confirm dataset structure and cache the first 100k samples. | |
| ### Manual Preparation Checklist (Do This First!) | |
| **Split into two phases to minimize GPU costs:** | |
| --- | |
| ## Part A: Local Preparation (BEFORE Launching GPU Instance) | |
| **Do these steps on your local machine or any CPU instance - no GPU needed, $0 cost:** | |
| #### Step 1: Get Your Authentication Tokens | |
| **Prepare these before launching GPU:** | |
| - **HuggingFace token**: https://huggingface.co/settings/tokens (create "Read" access token) | |
| - **W&B API key**: https://wandb.ai/authorize | |
| Save these somewhere - you'll need them on the GPU instance. | |
| #### Step 2: Prepare Dataset Verification Script Locally | |
| The full `verify_dataset.py` script is provided in the "Dataset Verification Script" section above (under Technical Implementation Plan). | |
| You can either: | |
| - Copy that script to a file on your local machine, OR | |
| - Recreate it directly on the GPU instance in Part B below | |
| No need to prepare this locally if you prefer to create it on the GPU instance. | |
| --- | |
| ## Part B: GPU Instance Setup (AFTER Launching GPU, BEFORE Training) | |
| **Complete these steps on your GPU instance to avoid wasting GPU credits on training failures:** | |
| **Estimated time: 30-60 minutes (mostly dataset download)** | |
| **GPU credits used: ~$0.75-$1.50** (30-60 min @ $1.55/hr for A100) | |
| #### Step 1: System Dependencies | |
| ```bash | |
| # Update system packages | |
| sudo apt-get update && sudo apt-get install -y git git-lfs build-essential | |
| # Initialize Git LFS | |
| git lfs install | |
| ``` | |
| #### Step 2: Python Environment with CUDA | |
| ```bash | |
| # Install PyTorch with CUDA 11.8 (requires GPU instance!) | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 | |
| # Install core ML libraries | |
| pip install diffusers transformers accelerate datasets | |
| # Install utilities | |
| pip install huggingface_hub pillow wandb xformers bitsandbytes | |
| ``` | |
| #### Step 3: Verify CUDA (Critical!) | |
| ```bash | |
| # Verify CUDA availability - MUST show "True" | |
| python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0)}')" | |
| ``` | |
| **Expected output:** | |
| ``` | |
| CUDA available: True | |
| CUDA version: 11.8 | |
| GPU: NVIDIA A100-SXM4-40GB | |
| ``` | |
| **If CUDA shows False:** Stop and troubleshoot before proceeding! | |
| #### Step 4: Clone Training Repository | |
| ```bash | |
| # Clone HuggingFace diffusers | |
| git clone https://github.com/huggingface/diffusers.git | |
| cd diffusers/examples/controlnet | |
| # Verify training script exists | |
| ls -la train_controlnet_sdxl.py # Should show the file | |
| ``` | |
| #### Step 5: Authentication Setup | |
| ```bash | |
| # Login to HuggingFace (use token from Part A) | |
| huggingface-cli login | |
| # Paste your token when prompted | |
| # Login to Weights & Biases (use API key from Part A) | |
| wandb login | |
| # Paste your API key when prompted | |
| ``` | |
| #### Step 6: Dataset Verification (CRITICAL!) | |
| ```bash | |
| # Create the verify_dataset.py script using the code from | |
| # "Dataset Verification Script" section at the top of this plan | |
| # (See lines after "Technical Implementation Plan" heading) | |
| # Once created, run it: | |
| chmod +x verify_dataset.py | |
| python verify_dataset.py | |
| ``` | |
| **Expected output:** | |
| ``` | |
| ============================================================ | |
| SDXL ControlNet Brightness - Dataset Verification | |
| ============================================================ | |
| [1/4] Loading dataset subset (99k samples)... | |
| This will download ~10-15GB to cache... | |
| β Successfully loaded 99000 samples | |
| [2/4] Verifying dataset structure... | |
| β Columns correct: ['image', 'conditioning_image', 'text'] | |
| [3/4] Checking sample data... | |
| β Image type: PIL.Image, size: (512, 512) | |
| β Conditioning image type: PIL.Image, size: (512, 512) | |
| β Caption type: str, length: 87 chars | |
| [4/4] Checking validation split (last 1000 samples)... | |
| β Validation split loaded: 1000 samples | |
| Validation uses: train[2999000:3000000] (last 1k) | |
| ============================================================ | |
| β ALL CHECKS PASSED! | |
| ============================================================ | |
| Dataset cached at: ~/.cache/huggingface/datasets/ | |
| Training samples: 99000 | |
| Validation samples: 1000 | |
| β οΈ IMPORTANT: Validation always uses samples 2,999,000-2,999,999 | |
| This ensures consistent validation across all training sizes | |
| (99k, 500k, 3M all use same validation set) | |
| You can now proceed with training! | |
| ``` | |
| #### Step 7: Pre-Flight Verification | |
| ```bash | |
| # Check all packages are installed | |
| pip list | grep -E "torch|diffusers|transformers|accelerate|datasets|xformers" | |
| # Check disk space (need ~20GB free for checkpoints) | |
| df -h ~ | |
| # Verify dataset cache exists | |
| ls -lh ~/.cache/huggingface/datasets/ | |
| ``` | |
| #### Step 8: Create Output Directory | |
| ```bash | |
| # Create directory for training outputs | |
| mkdir -p ~/controlnet-brightness-sdxl | |
| # Return to training directory | |
| cd ~/diffusers/examples/controlnet | |
| ``` | |
| --- | |
| ## β Preparation Complete! | |
| **Once all Part B steps pass, you're ready to start GPU training.** | |
| The training command (shown in Phase 3 below) will now: | |
| - β Use pre-downloaded dataset from cache (no re-download) | |
| - β Have all required libraries installed with CUDA support | |
| - β Be authenticated to HuggingFace and W&B | |
| - β Save checkpoints to the prepared directory | |
| **Total preparation cost:** ~$0.75-$1.50 (vs $35 for full training) | |
| **Why worth it:** Catches setup issues early without wasting 25 hours of GPU time | |
| **Hardware Selection (Updated for Lightning.ai):** | |
| - **π RECOMMENDED FOR TESTING**: Single H100 on Free Tier | |
| - 99k training in 45 min for $1.88 | |
| - Perfect for validation and hyperparameter tuning | |
| - 80GB VRAM allows good batch sizes | |
| - No subscription required | |
| - **π RECOMMENDED FOR PRODUCTION**: 6Γ H100 on Pro Plan ($20/month annual) | |
| - 3M training in 4 hours for $80 total | |
| - Can test multiple configs in one day | |
| - Net cost: ~$7/month after included credits | |
| - Cancel subscription after training complete | |
| - **Not Recommended**: A100 - H100 is faster and more cost-efficient | |
| ### Phase 2: Dataset Preparation | |
| **Dataset Split Strategy (for 99k quick training):** | |
| - **Training**: 99,000 samples (`split="train[:99000]"`) | |
| - **Validation**: 1,000 samples (`split="train[2999000:3000000]"`) - **ALWAYS last 1k** | |
| - **Total loaded**: 100,000 samples (99k + last 1k of 3M dataset) | |
| **β οΈ CRITICAL: Validation Always Uses Last 1000 Samples** | |
| - All training sizes (99k, 500k, 3M) use `train[2999000:3000000]` for validation | |
| - This ensures consistent validation set across all training runs | |
| - Allows fair comparison of model quality at different training stages | |
| - No overlap between training and validation for any training size | |
| **Why This Matters:** | |
| ``` | |
| β WRONG: Using different validation sets for different training sizes | |
| - 99k training: train[:99000] + validation train[99000:100000] | |
| - 500k training: train[:499000] + validation train[499000:500000] | |
| - 3M training: train[:2999000] + validation train[2999000:3000000] | |
| Problem: Can't compare results! Each uses different validation data. | |
| β CORRECT: Same validation set for all training sizes | |
| - 99k training: train[:99000] + validation train[2999000:3000000] | |
| - 500k training: train[:499000] + validation train[2999000:3000000] | |
| - 3M training: train[:2999000] + validation train[2999000:3000000] | |
| Benefit: Fair comparison across all training runs on same validation set. | |
| ``` | |
| ### Understanding HuggingFace Dataset Caching | |
| **Important**: The HuggingFace `datasets` library automatically caches all downloads to `~/.cache/huggingface/datasets/`. This means: | |
| β **Cache reuse is automatic**: When the training script runs, it will check the cache first and reuse any previously downloaded data | |
| β **No re-downloads**: You won't download the full 3M dataset if you've already downloaded a subset | |
| β **The pre-download step is OPTIONAL**: The training command can handle downloading on its own | |
| **Pre-download Benefits**: | |
| - Verify dataset structure before training starts | |
| - Separate download time from training time | |
| - Ensure dataset access works before committing GPU hours | |
| **Pre-download is NOT required**: The training script's `--max_train_samples=99000` parameter will work whether you pre-download or not. | |
| ### Dataset Download Options | |
| **Option A: Pre-download for verification (RECOMMENDED)** | |
| ```python | |
| from datasets import load_dataset | |
| # This downloads and caches ~100k samples for verification | |
| train_dataset = load_dataset( | |
| "latentcat/grayscale_image_aesthetic_3M", | |
| split="train[:99000]", | |
| cache_dir="~/.cache/huggingface/datasets" # Default cache location | |
| ) | |
| # Verify the dataset structure | |
| print(f"Dataset size: {len(train_dataset)}") | |
| print(f"Columns: {train_dataset.column_names}") | |
| print(f"First sample keys: {train_dataset[0].keys()}") | |
| # Check a sample | |
| sample = train_dataset[0] | |
| print(f"Image size: {sample['image'].size}") | |
| print(f"Conditioning image size: {sample['conditioning_image'].size}") | |
| print(f"Caption: {sample['text']}") | |
| ``` | |
| **Option B: Let training script handle download** | |
| - Simply run the training command with `--dataset_name` and `--max_train_samples` | |
| - The script will download to cache automatically | |
| - Slightly riskier if there are dataset access issues | |
| **Recommended:** Use the full `verify_dataset.py` script (see "Dataset Verification Script" section above) which implements Option A with comprehensive validation checks. | |
| **Data Format Validation:** | |
| - Verify columns: `image`, `conditioning_image`, `text` | |
| - Check image resolution: 512Γ512 (will be upscaled to 1024Γ1024 by script) | |
| - Validate grayscale format | |
| **Steps Calculation (IMPORTANT):** | |
| - Training samples: 99,000 | |
| - Batch size: 16 | |
| - Gradient accumulation: 4 | |
| - **Effective batch size**: 16 Γ 4 = 64 samples/step | |
| - **Steps per epoch**: 99,000 Γ· 64 = 1,547 steps | |
| - **For 2 epochs**: ~3,094 total steps | |
| ### Phase 3: Training Configuration | |
| **Prerequisites:** Complete the "Manual Preparation Checklist" above before running this command. | |
| **Training Command (Based on Latentcat Article):** | |
| ```bash | |
| export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0" | |
| export OUTPUT_DIR="./controlnet-brightness-sdxl" | |
| accelerate launch train_controlnet_sdxl.py \ | |
| --pretrained_model_name_or_path=$MODEL_DIR \ | |
| --dataset_name="latentcat/grayscale_image_aesthetic_3M" \ | |
| --max_train_samples=99000 \ | |
| --conditioning_image_column="conditioning_image" \ | |
| --image_column="image" \ | |
| --caption_column="text" \ | |
| --output_dir=$OUTPUT_DIR \ | |
| --mixed_precision="fp16" \ | |
| --resolution=512 \ | |
| --learning_rate=1e-5 \ | |
| --train_batch_size=16 \ | |
| --gradient_accumulation_steps=4 \ | |
| --num_train_epochs=2 \ | |
| --checkpointing_steps=1500 \ | |
| --validation_steps=1500 \ | |
| --tracker_project_name="brightness-controlnet-sdxl" \ | |
| --report_to="wandb" \ | |
| --enable_xformers_memory_efficient_attention \ | |
| --gradient_checkpointing \ | |
| --use_8bit_adam | |
| ``` | |
| **Key Parameters Explained:** | |
| - `--max_train_samples=99000`: Limit to 99k samples (reserves 1k for validation) | |
| - `--resolution=512`: Match dataset resolution (latentcat article used 512, not 1024) | |
| - `--learning_rate=1e-5`: From latentcat article | |
| - `--train_batch_size=16`: From latentcat article | |
| - `--gradient_accumulation_steps=4`: Effective batch = 16 Γ 4 = 64 | |
| - `--num_train_epochs=2`: From latentcat article | |
| - **`--checkpointing_steps=1500`**: Save every 1500 STEPS (~once per epoch) | |
| - Total training: ~3,094 steps for 2 epochs | |
| - Checkpoints at: 1500, 3000 steps | |
| - **`--validation_steps=1500`**: Run validation every 1500 STEPS | |
| - `--gradient_checkpointing`: Reduces VRAM usage | |
| - `--use_8bit_adam`: Memory optimization | |
| - `--enable_xformers_memory_efficient_attention`: Memory-efficient attention | |
| **Critical Understanding - Steps vs Samples:** | |
| - 1 STEP = processing 1 effective batch = 64 samples | |
| - Checkpoint every 1500 steps = every 1500 Γ 64 = 96,000 samples (~1 epoch) | |
| - NOT checkpoint every 1500 samples! | |
| - Total steps for 2 epochs: 99,000 Γ· 64 Γ 2 = 3,094 steps | |
| **VRAM Requirements with These Settings:** | |
| The settings above are optimized for memory efficiency: | |
| - `--mixed_precision="fp16"`: Halves memory usage | |
| - `--gradient_checkpointing`: Trades compute for memory (~40% VRAM savings) | |
| - `--use_8bit_adam`: Reduces optimizer state memory | |
| - `--enable_xformers_memory_efficient_attention`: Memory-efficient attention | |
| **Estimated VRAM usage:** | |
| - SDXL base model (FP16): ~6-7GB | |
| - ControlNet model: ~2.5GB | |
| - 8-bit Adam optimizer states: ~3-4GB | |
| - Gradients (with checkpointing): ~2-3GB | |
| - Activations (batch 16, 512Γ512, gradient checkpointing): ~8-12GB | |
| - **Total: ~22-28GB peak** | |
| **GPU Compatibility:** | |
| | GPU | VRAM | Will It Fit? | Batch Size | Notes | | |
| |-----|------|--------------|------------|-------| | |
| | **L4** | 24GB | β οΈ Tight | 8-12 | Reduce `--train_batch_size` to 8 or 12 | | |
| | **A100 40GB** | 40GB | β Yes | 16 | **Recommended** - comfortable fit | | |
| | **A100 80GB** | 80GB | β Yes | 16-24 | Plenty of headroom, can increase batch | | |
| | **H100 80GB** | 80GB | β Yes | 16-24 | Fastest training, plenty of VRAM | | |
| **Recommended: A100 40GB** - The settings will fit comfortably with batch size 16. | |
| **If using L4 24GB**, modify the command: | |
| ```bash | |
| # Change this line: | |
| --train_batch_size=16 \ | |
| # To: | |
| --train_batch_size=8 \ | |
| ``` | |
| This keeps effective batch size = 8 Γ 4 = 32 (half of 64), but still works well. | |
| ### Accelerate Configuration for Multi-GPU Training | |
| **Important:** Multi-GPU training on Lightning.ai requires the Pro plan ($20/month annual). | |
| #### Single GPU (Free Tier) - No Configuration Needed | |
| For single GPU training on Free tier, `accelerate launch` works without any configuration: | |
| ```bash | |
| # No accelerate config needed - auto-detects single GPU | |
| accelerate launch train_controlnet_sdxl.py [args...] | |
| ``` | |
| #### Multi-GPU (Pro Plan) - Configure Before Training | |
| For 6Γ H100 training on Pro plan, configure accelerate once: | |
| ```bash | |
| # Run configuration wizard | |
| accelerate config | |
| ``` | |
| **Configuration Options for 6Γ H100:** | |
| ```yaml | |
| compute_environment: LOCAL_MACHINE | |
| distributed_type: MULTI_GPU # Use DataParallel for multiple GPUs | |
| num_machines: 1 # Single machine with 6 GPUs | |
| num_processes: 6 # One process per GPU | |
| gpu_ids: all # Use all available GPUs | |
| mixed_precision: fp16 # Match training script | |
| use_cpu: false | |
| dynamo_backend: NO # Disable torch.compile for compatibility | |
| ``` | |
| **Quick Config (Non-Interactive):** | |
| ```bash | |
| # Create accelerate config file directly | |
| cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'EOF' | |
| compute_environment: LOCAL_MACHINE | |
| distributed_type: MULTI_GPU | |
| num_machines: 1 | |
| num_processes: 6 | |
| gpu_ids: all | |
| mixed_precision: fp16 | |
| use_cpu: false | |
| dynamo_backend: NO | |
| EOF | |
| ``` | |
| **Verify Configuration:** | |
| ```bash | |
| # Check configuration | |
| accelerate env | |
| # Test multi-GPU setup | |
| accelerate test | |
| ``` | |
| **Launch Multi-GPU Training:** | |
| ```bash | |
| # With configuration file, launch works same as single GPU | |
| accelerate launch train_controlnet_sdxl.py [args...] | |
| # Or specify config explicitly | |
| accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \ | |
| train_controlnet_sdxl.py [args...] | |
| ``` | |
| ### H100-Optimized Training Parameters | |
| The H100 GPU has **80GB VRAM** and **1979 TFLOPs**, allowing for larger batch sizes and better optimization than A100. | |
| #### Optimal Batch Size for H100 | |
| **Default settings (designed for A100 40GB):** | |
| ```bash | |
| --train_batch_size=16 | |
| --gradient_accumulation_steps=4 | |
| # Effective batch size: 16 Γ 4 = 64 samples/step | |
| # VRAM usage: ~22-28GB | |
| ``` | |
| **H100-optimized settings (80GB VRAM):** | |
| ```bash | |
| --train_batch_size=32 # 2Γ larger than A100 | |
| --gradient_accumulation_steps=4 | |
| # Effective batch size: 32 Γ 4 = 128 samples/step | |
| # VRAM usage: ~40-48GB (still plenty of headroom) | |
| ``` | |
| **Aggressive H100 settings (maximum throughput):** | |
| ```bash | |
| --train_batch_size=48 # 3Γ larger than A100 | |
| --gradient_accumulation_steps=2 # Reduce accumulation since batch is larger | |
| # Effective batch size: 48 Γ 2 = 96 samples/step | |
| # VRAM usage: ~55-65GB | |
| # Faster training due to fewer gradient accumulation steps | |
| ``` | |
| #### Single H100 Training Command (99k samples) | |
| **Optimized for H100 80GB:** | |
| ```bash | |
| export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0" | |
| export OUTPUT_DIR="./controlnet-brightness-sdxl-h100" | |
| accelerate launch train_controlnet_sdxl.py \ | |
| --pretrained_model_name_or_path=$MODEL_DIR \ | |
| --dataset_name="latentcat/grayscale_image_aesthetic_3M" \ | |
| --max_train_samples=99000 \ | |
| --conditioning_image_column="conditioning_image" \ | |
| --image_column="image" \ | |
| --caption_column="text" \ | |
| --output_dir=$OUTPUT_DIR \ | |
| --mixed_precision="fp16" \ | |
| --resolution=512 \ | |
| --learning_rate=1e-5 \ | |
| --train_batch_size=32 \ | |
| --gradient_accumulation_steps=4 \ | |
| --num_train_epochs=2 \ | |
| --checkpointing_steps=750 \ | |
| --validation_steps=750 \ | |
| --tracker_project_name="brightness-controlnet-sdxl-h100" \ | |
| --report_to="wandb" \ | |
| --enable_xformers_memory_efficient_attention \ | |
| --gradient_checkpointing \ | |
| --use_8bit_adam \ | |
| --dataloader_num_workers=8 \ | |
| --set_grads_to_none | |
| ``` | |
| **Key H100 Optimizations:** | |
| - `--train_batch_size=32` (vs 16 on A100) - 2Γ larger batches | |
| - `--gradient_accumulation_steps=4` - Effective batch = 128 | |
| - `--checkpointing_steps=750` - More frequent (every ~96k samples) | |
| - `--dataloader_num_workers=8` - Faster data loading (H100 has 192 CPUs) | |
| - `--set_grads_to_none` - Faster than zero_grad() on modern GPUs | |
| **Expected Performance:** | |
| - Steps per epoch: 99,000 Γ· 128 = 773 steps | |
| - Total steps (2 epochs): ~1,546 steps | |
| - Training time: ~38-45 minutes on single H100 | |
| - Checkpoints saved at: 750, 1500 steps | |
| #### 6Γ H100 Training Command (3M samples) - Pro Plan | |
| **For Pro plan multi-GPU training:** | |
| ```bash | |
| export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0" | |
| export OUTPUT_DIR="./controlnet-brightness-sdxl-multi-h100" | |
| # Configure accelerate for 6 GPUs (if not done already) | |
| accelerate config # Select MULTI_GPU, 6 processes | |
| # Launch training | |
| accelerate launch train_controlnet_sdxl.py \ | |
| --pretrained_model_name_or_path=$MODEL_DIR \ | |
| --dataset_name="latentcat/grayscale_image_aesthetic_3M" \ | |
| --max_train_samples=2999000 \ | |
| --conditioning_image_column="conditioning_image" \ | |
| --image_column="image" \ | |
| --caption_column="text" \ | |
| --output_dir=$OUTPUT_DIR \ | |
| --mixed_precision="fp16" \ | |
| --resolution=512 \ | |
| --learning_rate=1e-5 \ | |
| --train_batch_size=24 \ | |
| --gradient_accumulation_steps=2 \ | |
| --num_train_epochs=1 \ | |
| --checkpointing_steps=2500 \ | |
| --validation_steps=2500 \ | |
| --tracker_project_name="brightness-controlnet-sdxl-3M" \ | |
| --report_to="wandb" \ | |
| --enable_xformers_memory_efficient_attention \ | |
| --gradient_checkpointing \ | |
| --use_8bit_adam \ | |
| --dataloader_num_workers=8 \ | |
| --set_grads_to_none \ | |
| --resume_from_checkpoint="latest" | |
| ``` | |
| **Multi-GPU Optimizations:** | |
| - `--train_batch_size=24` per GPU Γ 6 GPUs = 144 samples per step (before accumulation) | |
| - `--gradient_accumulation_steps=2` - Effective batch = 144 Γ 2 = 288 | |
| - `--checkpointing_steps=2500` - Save every ~720k samples | |
| - `--resume_from_checkpoint="latest"` - Auto-resume if interrupted | |
| **Expected Performance:** | |
| - Effective batch size: 288 samples/step | |
| - Steps per epoch: 2,999,000 Γ· 288 = ~10,413 steps | |
| - Training time: ~4 hours on 6Γ H100 | |
| - Checkpoints: 2500, 5000, 7500, 10000 steps + final | |
| #### Batch Size Selection Guide | |
| | GPU Config | VRAM | Recommended batch_size | grad_accum_steps | Effective Batch | Training Speed | | |
| |------------|------|------------------------|------------------|-----------------|----------------| | |
| | Single L4 | 24GB | 8 | 4 | 32 | Slow (baseline) | | |
| | Single A100 | 40GB | 16 | 4 | 64 | 2Γ faster than L4 | | |
| | Single H100 | 80GB | 32 | 4 | 128 | 6Γ faster than L4 | | |
| | 6Γ H100 (Pro) | 480GB | 24/GPU | 2 | 288 | 36Γ faster than L4 | | |
| **Rule of Thumb:** | |
| - Larger `train_batch_size` = better GPU utilization, faster training | |
| - Larger `effective_batch_size` = more stable training, better convergence | |
| - H100 can handle 2-3Γ larger batch sizes than A100 with same settings | |
| #### Memory Optimization Tips | |
| **If you encounter OOM (Out of Memory) errors on H100:** | |
| 1. **Reduce batch size incrementally:** | |
| ```bash | |
| --train_batch_size=32 # Start here | |
| --train_batch_size=24 # If OOM | |
| --train_batch_size=16 # If still OOM | |
| ``` | |
| 2. **Enable additional memory optimizations:** | |
| ```bash | |
| --gradient_checkpointing \ # Already enabled | |
| --use_8bit_adam \ # Already enabled | |
| --enable_xformers_memory_efficient_attention \ # Already enabled | |
| --set_grads_to_none \ # Use this instead of zero_grad() | |
| ``` | |
| 3. **Use gradient accumulation to maintain effective batch size:** | |
| ```bash | |
| # If reducing from batch_size=32 to batch_size=16 | |
| --train_batch_size=16 | |
| --gradient_accumulation_steps=8 # Double accumulation to keep effective=128 | |
| ``` | |
| ### Full 3M Dataset Training Options | |
| **For maximum quality training on the complete dataset:** | |
| #### Option A: Single H100 (Free Tier) | |
| | Metric | Value | | |
| |--------|-------| | |
| | GPU | 1Γ H100 80GB (~$2.50/hr on Lightning.ai) | | |
| | Dataset | 2,999,000 training + 1,000 validation | | |
| | Estimated Duration | **~24 hours** | | |
| | Estimated Cost | **$60 GPU credits** | | |
| | Subscription Cost | **$0** (Free tier) | | |
| | **Total Cost** | **$60** | | |
| | Checkpoints | Every 5000 steps (~every 320k samples) | | |
| **Pros:** | |
| - β Lowest total cost | |
| - β No subscription required | |
| - β Good for one-time training | |
| **Cons:** | |
| - β 24 hours training time (must monitor) | |
| - β Can't quickly iterate if issues arise | |
| #### Option B: 6Γ H100 (Pro Plan - $20/month) | |
| | Metric | Value | | |
| |--------|-------| | |
| | GPU | 6Γ H100 80GB (~$2.50/hr Γ 6 = $15/hr) | | |
| | Dataset | 2,999,000 training + 1,000 validation | | |
| | Estimated Duration | **~4 hours** | | |
| | Estimated Cost | **$60 GPU credits** | | |
| | Subscription Cost | **$20/month** (annual billing) | | |
| | **Total Cost** | **$80** | | |
| | **Net Cost** | **$67** (after $13 annual credit value) | | |
| | Checkpoints | Every 5000 steps (~every 320k samples) | | |
| **Pros:** | |
| - β Completes in 4 hours vs 24 hours | |
| - β Can run same-day if needed | |
| - β Can test multiple configs quickly | |
| - β Net cost only $7/month after credits | |
| - β Can cancel after training | |
| **Cons:** | |
| - β $20 upfront subscription cost | |
| **Scaling Math:** | |
| - Single H100: 99k in 45 min β 3M in 45 min Γ 30.3 = ~24 hours | |
| - 6Γ H100: 24 hours Γ· 6 = ~4 hours | |
| **Cost Comparison:** | |
| - Free tier: $60, 24 hours wait | |
| - Pro plan: $80, 4 hours wait | |
| - **Price difference: $20 to save 20 hours** | |
| #### Adjusted Training Command | |
| ```bash | |
| export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0" | |
| export OUTPUT_DIR="./controlnet-brightness-sdxl-3M" | |
| accelerate launch train_controlnet_sdxl.py \ | |
| --pretrained_model_name_or_path=$MODEL_DIR \ | |
| --dataset_name="latentcat/grayscale_image_aesthetic_3M" \ | |
| --max_train_samples=2999000 \ | |
| --conditioning_image_column="conditioning_image" \ | |
| --image_column="image" \ | |
| --caption_column="text" \ | |
| --output_dir=$OUTPUT_DIR \ | |
| --mixed_precision="fp16" \ | |
| --resolution=512 \ | |
| --learning_rate=1e-5 \ | |
| --train_batch_size=24 \ | |
| --gradient_accumulation_steps=4 \ | |
| --num_train_epochs=1 \ | |
| --checkpointing_steps=5000 \ | |
| --validation_steps=5000 \ | |
| --validation_prompts="a beautiful garden scene" "modern city street" "abstract art pattern" \ | |
| --tracker_project_name="brightness-controlnet-sdxl-3M" \ | |
| --report_to="wandb" \ | |
| --enable_xformers_memory_efficient_attention \ | |
| --gradient_checkpointing \ | |
| --use_8bit_adam \ | |
| --resume_from_checkpoint="latest" | |
| ``` | |
| #### Key Adjustments Explained | |
| **Batch Size Scaling:** | |
| - **`--train_batch_size=24`** (increased from 16) | |
| - H100 80GB has 2x VRAM of A100 40GB | |
| - Can safely increase batch size by 50% | |
| - Alternative: `--train_batch_size=32` if you have headroom | |
| - **`--gradient_accumulation_steps=4`** (kept same) | |
| - Effective batch size: 24 Γ 4 = **96 samples/step** | |
| - If using batch_size=32: 32 Γ 4 = **128 samples/step** | |
| **Dataset & Checkpointing:** | |
| - **`--max_train_samples=2999000`** (vs 99,000 for quick training) | |
| - Training split: `train[:2999000]` (first 2,999,000 samples) | |
| - **Validation split: `train[2999000:3000000]` (SAME as 99k training!)** | |
| - β This allows direct comparison of validation metrics between 99k and 3M training | |
| - β No overlap between training and validation data | |
| - **`--num_train_epochs=1`** (vs 2) | |
| - For 3M samples, 1 epoch is usually sufficient | |
| - Can increase to 2 if quality needs improvement | |
| - **`--checkpointing_steps=5000`** (vs 1,500) | |
| - More frequent checkpoints would create too many files | |
| - 5000 steps = every ~480k samples | |
| - Total checkpoints: ~6-7 for full run | |
| - **`--validation_steps=5000`** (matches checkpointing) | |
| - Run validation at each checkpoint | |
| **Resumption:** | |
| - **`--resume_from_checkpoint="latest"`** | |
| - CRITICAL for multi-day training | |
| - If training crashes, automatically resumes from last checkpoint | |
| - Saves days of retraining if interrupted | |
| #### Training Math | |
| **Steps Calculation:** | |
| - Training samples: 2,999,000 (validation: 1,000) | |
| - Effective batch size: 96 (or 128 with batch_size=32) | |
| - Steps per epoch: 2,999,000 Γ· 96 = **31,240 steps** | |
| - With batch_size=32: 2,999,000 Γ· 128 = **23,429 steps** | |
| - For 1 epoch: 31,240 steps total | |
| - For 2 epochs: 62,480 steps total | |
| **Checkpoints:** | |
| - Saved every 5,000 steps | |
| - Checkpoint locations: steps 5000, 10000, 15000, 20000, 25000, 30000, 31240 (final) | |
| - Each checkpoint: ~2.5GB (ControlNet weights) | |
| - Total storage: ~20GB for all checkpoints + training state | |
| #### VRAM Usage (H100 80GB) | |
| With batch_size=24: | |
| - SDXL base model (FP16): ~6-7GB | |
| - ControlNet model: ~2.5GB | |
| - 8-bit Adam optimizer: ~3-4GB | |
| - Gradients (with checkpointing): ~3-4GB | |
| - Activations (batch 24): ~15-20GB | |
| - **Total: ~35-40GB** β Fits comfortably in 80GB | |
| With batch_size=32 (max): | |
| - Activations increase to ~20-25GB | |
| - **Total: ~42-48GB** β Still fits with headroom | |
| **Recommended:** Start with batch_size=24, monitor VRAM in W&B, can increase to 32 if using <60GB. | |
| #### Risk Mitigation for Long Training | |
| **Strategy 1: Incremental Training** | |
| ```bash | |
| # Start with 500k samples to validate approach | |
| --max_train_samples=500000 | |
| # Cost: ~$150, Duration: ~75 hours | |
| # If results good, continue to full 3M | |
| ``` | |
| **Strategy 2: Early Checkpoint Evaluation** | |
| ```bash | |
| # Evaluate quality at checkpoints: | |
| # - checkpoint-5000 (~480k samples, ~32 hours, ~$63) | |
| # - checkpoint-10000 (~960k samples, ~64 hours, ~$127) | |
| # - checkpoint-15000 (~1.4M samples, ~96 hours, ~$191) | |
| # Can stop early if quality plateaus | |
| ``` | |
| **Strategy 3: Use Spot Instances** | |
| - Many cloud providers offer H100 spot instances at 50-70% discount | |
| - Cost could drop to $0.60-$1.00/hr (~$270-$600 total) | |
| - Requires `--resume_from_checkpoint="latest"` (already included) | |
| - Risk: Training may be interrupted, but will resume automatically | |
| #### When to Use Full 3M Training | |
| **Use 99k samples if:** | |
| - β First time training ControlNet | |
| - β Testing hyperparameters | |
| - β Budget constrained (<$50) | |
| - β Need results quickly (1-2 days) | |
| **Use 3M samples if:** | |
| - β 99k results are good but want better quality | |
| - β Commercial production use (worth the investment) | |
| - β Training other ControlNet types (can reuse knowledge) | |
| - β Contributing to research/community (publishable results) | |
| - β Budget allows ($900-$1,200) | |
| ### Phase 4: Training Monitoring | |
| **Setup Weights & Biases:** | |
| ```bash | |
| wandb login | |
| # Use wandb to track: | |
| # - Loss curves | |
| # - Validation images every 500 steps | |
| # - Learning rate schedule | |
| # - GPU utilization | |
| ``` | |
| **Checkpoints:** | |
| - Saved every 1,500 steps to `$OUTPUT_DIR/checkpoint-{step}` | |
| - With ~3,094 total steps, will get checkpoints at: | |
| - `checkpoint-1500` (~97% of epoch 1) | |
| - `checkpoint-3000` (~94% of epoch 2) | |
| - Final model at end of training | |
| - Can resume training if interrupted: `--resume_from_checkpoint="./controlnet-brightness-sdxl/checkpoint-1500"` | |
| **Validation:** | |
| - Uses 1,000 validation samples from `train[99000:100000]` | |
| - Runs every 1,500 steps (at checkpoints) | |
| - W&B logs validation images and metrics | |
| - No need for manual validation prompts/images | |
| ### Validation Metrics (Automatic) | |
| **No configuration needed!** The training script automatically computes validation metrics: | |
| **Loss Function (Automatic)**: | |
| - **Default**: MSE (Mean Squared Error) loss between predicted and target images | |
| - **Optional**: Huber loss - add `--loss_type="huber"` to training command | |
| - **Formula**: `loss = F.mse_loss(model_pred.float(), target.float())` | |
| **What Gets Logged to W&B**: | |
| 1. **Training loss** (every step) | |
| 2. **Validation loss** (every `--validation_steps=1500` steps) | |
| 3. **Validation images** (generated samples at validation time) | |
| 4. **Learning rate** (schedule tracking) | |
| 5. **GPU utilization** (hardware monitoring) | |
| **Validation Process**: | |
| 1. Every 1500 steps, training pauses | |
| 2. Model generates images from validation set | |
| 3. Same MSE/Huber loss computed on validation samples | |
| 4. Loss + images logged to W&B | |
| 5. Training resumes | |
| **No manual metrics needed** - everything is handled by the training script! | |
| ### Phase 5: Model Evaluation & Publishing | |
| **Test Inference:** | |
| First, install QR code library if needed: | |
| ```bash | |
| pip install qrcode[pil] | |
| ``` | |
| Then run inference: | |
| ```python | |
| from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel | |
| import torch | |
| import qrcode | |
| from PIL import Image | |
| # Generate QR code for testing | |
| print("Generating QR code for https://google.com...") | |
| qr = qrcode.QRCode( | |
| version=1, | |
| error_correction=qrcode.constants.ERROR_CORRECT_H, | |
| box_size=10, | |
| border=4, | |
| ) | |
| qr.add_data("https://google.com") | |
| qr.make(fit=True) | |
| # Create QR code image and resize to 1024x1024 | |
| qr_image = qr.make_image(fill_color="black", back_color="white") | |
| qr_image = qr_image.resize((1024, 1024), Image.LANCZOS) | |
| print(f"QR code generated: {qr_image.size}") | |
| # Load trained ControlNet | |
| print("Loading ControlNet model...") | |
| controlnet = ControlNetModel.from_pretrained( | |
| "./controlnet-brightness-sdxl/checkpoint-3000", # or checkpoint-1500 | |
| torch_dtype=torch.float16 | |
| ) | |
| # Load SDXL pipeline with ControlNet | |
| print("Loading SDXL pipeline...") | |
| pipe = StableDiffusionXLControlNetPipeline.from_pretrained( | |
| "stabilityai/stable-diffusion-xl-base-1.0", | |
| controlnet=controlnet, | |
| torch_dtype=torch.float16 | |
| ) | |
| pipe.enable_xformers_memory_efficient_attention() | |
| pipe.to("cuda") | |
| # Generate artistic QR code | |
| print("Generating artistic QR code...") | |
| image = pipe( | |
| prompt="a beautiful garden scene with flowers, highly detailed, professional photography", | |
| negative_prompt="blurry, low quality, distorted", | |
| image=qr_image, | |
| num_inference_steps=30, | |
| controlnet_conditioning_scale=0.45, # Adjust 0.3-0.6 for balance | |
| guidance_scale=7.5, | |
| ).images[0] | |
| # Save results | |
| qr_image.save("original_qr.png") | |
| image.save("artistic_qr_result.png") | |
| print("β Done! Check artistic_qr_result.png") | |
| print("π± Scan with phone to verify QR code still works!") | |
| ``` | |
| **Testing Different Conditioning Scales:** | |
| ```python | |
| # Test multiple conditioning scales to find best balance | |
| for scale in [0.3, 0.4, 0.5, 0.6]: | |
| print(f"Testing conditioning_scale={scale}...") | |
| image = pipe( | |
| prompt="a beautiful garden scene with flowers", | |
| image=qr_image, | |
| num_inference_steps=30, | |
| controlnet_conditioning_scale=scale, | |
| ).images[0] | |
| image.save(f"result_scale_{scale}.png") | |
| ``` | |
| **Publish to HuggingFace Hub:** | |
| ```bash | |
| # After validation | |
| huggingface-cli login | |
| python scripts/upload_to_hub.py \ | |
| --model_path="./controlnet-brightness-sdxl/checkpoint-50000" \ | |
| --repo_name="Oysiyl/controlnet-brightness-sdxl" | |
| ``` | |
| ## Cost-Benefit Analysis | |
| ### Investment Required (Updated for Single H100) | |
| **Strategy A: Free Tier (99k Quick Test)** | |
| | Component | Cost/Time | | |
| |-----------|-----------| | |
| | GPU Credits (99k samples, 2 epochs, single H100) | $1.88 | | |
| | Setup Time | 1-2 hours | | |
| | Training Duration | **45 minutes** β‘ | | |
| | Testing & Validation | 2-3 hours | | |
| | **Total Time** | **~4-6 hours** (same day) | | |
| | **Total Cost** | **$1.88** | | |
| **Strategy B: Pro Plan (Full 3M Training)** | |
| | Component | Cost/Time | | |
| |-----------|-----------| | |
| | Pro Subscription (can cancel after) | $20/month | | |
| | Included credits value | -$13 (240 credits/year) | | |
| | GPU Credits (3M samples, 1 epoch, 6ΓH100) | $60 | | |
| | Setup Time | 1-2 hours | | |
| | Training Duration | **4 hours** β‘ | | |
| | Testing & Validation | 2-3 hours | | |
| | **Total Time** | **~8 hours** (same day) | | |
| | **Total Cost** | **$80** ($20 sub + $60 GPU) | | |
| | **Net Cost** | **$67** (after annual credit value) | | |
| **Strategy C: All-in-One (Pro Plan, Test Everything)** | |
| | Component | Cost/Time | | |
| |-----------|-----------| | |
| | Pro Subscription | $20/month | | |
| | 99k test (6ΓH100) | $1.88 (7.5 min) | | |
| | 500k training (6ΓH100) | $10 (40 min) | | |
| | 3M training (6ΓH100) | $60 (4 hours) | | |
| | **Total GPU Time** | **~5 hours** | | |
| | **Total GPU Cost** | **$71.88** | | |
| | **Total with Sub** | **$91.88** | | |
| | **Net after credits** | **$78.88** | | |
| **Recommendation:** Start with Strategy A ($1.88), upgrade to Strategy B if promising | |
| ### Value Delivered | |
| 1. **Unblocks SDXL Migration**: Enables upgrade from SD 1.5 to higher quality SDXL | |
| 2. **Better Image Quality**: SDXL produces superior 1024Γ1024 images vs SD 1.5's 512Γ512 | |
| 3. **Community Value**: First public SDXL brightness ControlNet (potential citations/recognition) | |
| 4. **No Alternatives**: Cannot proceed with SDXL QR code generation without this model | |
| 5. **Reusable Asset**: Once trained, can be used indefinitely | |
| ### Risk Mitigation | |
| - **Start Small**: Train on 100k samples first (~$40, 1-2 days) | |
| - **Evaluate Early**: Check quality at checkpoint-5000, checkpoint-10000 | |
| - **Iterative Approach**: Extend training only if initial results are promising | |
| - **Fallback**: Can continue using SD 1.5 if SDXL training fails | |
| ## Alternative Approaches Considered | |
| ### Option 1: Train Brightness ControlNet for SDXL (RECOMMENDED) | |
| - **Pros**: | |
| - Proven training pipeline (diffusers script exists) | |
| - Same dataset as original SD 1.5 model | |
| - Good quality/cost balance | |
| - Community support and documentation | |
| - License-friendly (SDXL is permissive) | |
| - **Cons**: | |
| - Requires GPU time investment ($75-$300) | |
| - 4-5 days training duration | |
| - Still requires 24GB+ VRAM for inference | |
| - **Cost**: $155 for 500k samples on A100 (recommended) | |
| - **Risk**: Low - well-documented process | |
| - **Verdict**: β **Best choice for production use** | |
| ### Option 2: Train Brightness ControlNet for Flux Schnell | |
| - **Pros**: | |
| - Apache 2.0 license (fully commercial) | |
| - Faster inference than Flux Dev (3Γ speedup) | |
| - Same architecture as Dev (12B parameters) | |
| - Would be first-of-its-kind community contribution | |
| - **Cons**: | |
| - β οΈ **No existing training scripts for Schnell** | |
| - Would need to adapt Flux Dev training code | |
| - Unknown if distillation affects ControlNet training | |
| - Still requires 32-40GB VRAM (heavier than SDXL) | |
| - Higher risk and uncertainty | |
| - Longer training time due to larger model | |
| - **Cost**: $200-$500 (estimated, higher due to larger model) | |
| - **Risk**: High - experimental, no precedent | |
| - **Verdict**: π¬ **Experimental - only if willing to pioneer new territory** | |
| ### Option 3: Use SDXL LoRA for Brightness Control | |
| - **Pros**: No training required, immediate availability | |
| - **Cons**: Less precise control than dedicated ControlNet, may not work well for QR codes | |
| - **Verdict**: Worth testing but likely insufficient for QR code use case | |
| ### Option 4: Latent Initialization Approach | |
| - **Pros**: Architecture-agnostic, works with both SDXL and Flux | |
| - **Cons**: Less control over brightness distribution, requires experimentation | |
| - **Verdict**: Good fallback but not as reliable as ControlNet | |
| ### Option 5: Wait for Community Release | |
| - **Pros**: Zero cost, zero effort | |
| - **Cons**: No timeline, may never happen, blocks project progress | |
| - **Verdict**: Not viable for active development | |
| ### Option 6: Hybrid Tile ControlNet + Post-Processing | |
| - **Pros**: Tile ControlNet available for SDXL | |
| - **Cons**: Doesn't address brightness control directly | |
| - **Verdict**: Complementary but not a replacement | |
| **Conclusion**: Training SDXL ControlNet is the most reliable solution. Flux Schnell is interesting for research but carries significant execution risk. | |
| ## Recommended Action Plan | |
| ### Immediate Setup (Day 1) | |
| 1. **Launch Lightning AI Instance**: A100 40GB GPU | |
| 2. **Run Setup Commands**: Install all dependencies (see Phase 3 above) | |
| 3. **Authenticate**: HuggingFace and W&B login | |
| 4. **Clone Diffusers**: Get training scripts | |
| ### Training Phase (Day 1 - Morning) β‘ | |
| 5. **Start Training**: Launch training with 99k samples (~45 minutes on 8ΓH100) | |
| 6. **Monitor W&B**: Track loss curves and validation images in real-time | |
| 7. **First Checkpoint**: Review checkpoint-1500 (~25 minutes in) | |
| 8. **Training Complete**: Total ~45 minutes for full 2-epoch run | |
| ### Evaluation Phase (Day 1 - Afternoon) | |
| 9. **Post-Training Validation**: Run inference on 1k validation set | |
| 10. **QR Code Testing**: Test with actual QR codes, measure scannability | |
| 11. **Quality Assessment**: Compare to SD 1.5 brightness ControlNet | |
| 12. **Decision Point**: | |
| - If quality good: Publish and integrate (move to next phase) | |
| - If needs improvement: Launch 2nd training run with adjusted hyperparameters (~45 min) | |
| - Can try 3-4 different configurations in same day! | |
| ### Optional: Full Dataset Training (Day 1 - Evening) | |
| 12a. **If 99k results promising**: Launch full 3M training (~2 hours on 8ΓH100) | |
| 12b. **Monitor overnight**: W&B tracks progress automatically | |
| 12c. **Next morning**: Evaluate final model quality | |
| ### Integration Phase (Day 2) | |
| 13. **Publish to HuggingFace**: Upload best checkpoint | |
| 14. **Update app_sdxl.py**: Integrate new ControlNet model | |
| 15. **Production Testing**: End-to-end QR code generation tests | |
| 16. **Documentation**: Update README with SDXL support | |
| **Total Timeline: 1-2 days** (vs previous estimate of 5 days) | |
| ## Success Metrics | |
| 1. **QR Code Scannability**: 95%+ scan rate on generated images | |
| 2. **Visual Quality**: Subjective improvement over SD 1.5 outputs | |
| 3. **Control Precision**: Ability to adjust brightness strength (0.0-1.0 range) | |
| 4. **Training Loss**: Convergence to < 0.1 validation loss | |
| 5. **Community Adoption**: Positive feedback if published publicly | |
| ## Critical Files to Modify | |
| Once model is trained: | |
| - `app.py:48-56` - Add SDXL ControlNet loading | |
| - `app.py:1880-1886` - Update standard pipeline with SDXL support | |
| - `app.py:2343-2349` - Update artistic pipeline with SDXL support | |
| - `app_sdxl.py` - Complete SDXL-specific implementation | |
| - `comfy/sd_configs/` - Add SDXL configuration if needed | |
| ## Flux Schnell Training Considerations (If Pursuing) | |
| If you decide to pursue Flux Schnell ControlNet training despite the risks: | |
| **Required Adaptations:** | |
| 1. **Training Script Modification**: Adapt `train_controlnet_flux.py` to work with Schnell | |
| - Model path: `black-forest-labs/FLUX.1-schnell` instead of `FLUX.1-dev` | |
| - Verify architecture compatibility (distillation may affect ControlNet layers) | |
| - Test with small pilot run (1000 steps) before full training | |
| 2. **Hardware Requirements**: | |
| - Minimum: H100 (80GB VRAM) - $1.99/hr | |
| - A100 40GB likely insufficient for Flux training | |
| - Estimated training: 150-250 hours on H100 (~$300-$500) | |
| 3. **Dataset Considerations**: | |
| - Flux uses 1024Γ1024 resolution (same as SDXL) | |
| - Dataset would need upscaling from 512Γ512 or re-preprocessing | |
| - Consider starting with 100k subset for validation | |
| 4. **Verification Steps**: | |
| - Test if Schnell's distillation preserves ControlNet training capability | |
| - Compare with Flux Dev training (if available for testing) | |
| - Validate brightness control precision matches SD 1.5 quality | |
| **Risk Assessment**: | |
| - **Technical Risk**: High - no proven training path | |
| - **Time Risk**: Medium-High - debugging could extend timeline significantly | |
| - **Cost Risk**: High - may require multiple training attempts ($500+) | |
| - **Success Probability**: 50-70% (educated guess based on architecture similarity) | |
| **Recommendation**: Only pursue if: | |
| 1. SDXL training completes successfully first (de-risk approach) | |
| 2. You're willing to contribute pioneering work to the community | |
| 3. Budget allows for experimental work ($500-1000 total including failed attempts) | |
| ## References | |
| ### SDXL Training | |
| - **SDXL Training Script**: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sdxl.py | |
| - **Dataset**: https://huggingface.co/datasets/latentcat/grayscale_image_aesthetic_3M | |
| - **Reference Article**: https://latentcat.com/en/blog/brightness-controlnet | |
| - **Original SD 1.5 Model**: https://huggingface.co/latentcat/latentcat-controlnet | |
| - **Lightning AI**: https://lightning.ai/ | |
| ### Flux Information | |
| - **Flux Schnell Model**: https://huggingface.co/black-forest-labs/FLUX.1-schnell | |
| - **Flux Dev Training Script**: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_flux.py | |
| - **XLabs-AI Flux ControlNets**: https://huggingface.co/XLabs-AI/flux-controlnet-collections | |
| - **Flux Comparison Guide**: [Flux Dev vs Schnell Comparison](https://www.stablediffusiontutorials.com/2025/04/flux-schnell-dev-pro.html) | |
| - **Flux Architecture Discussion**: [GitHub Issue #408](https://github.com/black-forest-labs/flux/issues/408) | |
| - **License Comparison**: [Flux Model Guide](https://stable-diffusion-art.com/flux/) | |
| ## Final Recommendation (Updated December 2024 - Lightning.ai) | |
| **Proceed with SDXL Brightness ControlNet Training on Single H100 (Free Tier)** | |
| Based on Lightning.ai pricing and multi-GPU requirements, the recommended path is: | |
| ### Phase 1: Quick Validation (Free Tier) | |
| 1. **Start with 99k samples on single H100** | |
| - Cost: $1.88 in GPU credits | |
| - Duration: 45 minutes | |
| - Platform: Lightning.ai Free tier | |
| - Purpose: Validate training pipeline and quality | |
| ### Phase 2: Production Training (Choose Based on Phase 1) | |
| **Option A: Budget Approach (Free Tier)** | |
| - Run full 3M dataset on single H100 | |
| - Cost: $60 GPU credits, $0 subscription | |
| - Duration: 24 hours | |
| - Total: $60 | |
| - Best for: One-time training, have patience | |
| **Option B: Speed Approach (Pro Plan)** | |
| - Upgrade to Pro plan ($20/month annual) | |
| - Run full 3M dataset on 6Γ H100 | |
| - Cost: $60 GPU + $20 subscription = $80 | |
| - Net cost: $67 (after $13 annual credit value) | |
| - Duration: 4 hours | |
| - Best for: Need results same day, may iterate | |
| ### Recommended Strategy | |
| **Most Cost-Effective Path:** | |
| 1. **Day 1 Morning**: Run 99k test on Free tier ($1.88, 45 min) | |
| 2. **Day 1 Afternoon**: Evaluate results | |
| 3. **If promising**: | |
| - **Budget route**: Start 3M on Free tier ($60, 24 hrs) β Total: $61.88 | |
| - **Speed route**: Upgrade to Pro, run 3M ($80, 4 hrs) β Total: $81.88 | |
| 4. **Cancel Pro** after training if using speed route | |
| ### Why This Path | |
| - **Low Risk Entry**: Only $1.88 to validate entire pipeline | |
| - **Flexible Scaling**: Choose speed vs cost based on results | |
| - **Proven Pipeline**: HuggingFace Diffusers battle-tested script | |
| - **Reference Success**: Original SD 1.5 model trained on same dataset | |
| - **H100 Advantage**: 6.3Γ faster than A100 even on single GPU | |
| - **Cost-Effective**: $62-$82 total (vs $900+ on older plans) | |
| - **Unblocks Migration**: Enables full SDXL upgrade from SD 1.5 | |
| ### Cost Breakdown Comparison | |
| | Approach | Hardware | Duration | GPU Cost | Sub Cost | Total | Timeline | | |
| |----------|----------|----------|----------|----------|-------|----------| | |
| | **Old Plan (A100)** | Single A100 | 180 hours | $900-1,200 | $0 | $900-1,200 | 1 week | | |
| | **NEW: Free Tier** | Single H100 | 24.75 hours | $61.88 | $0 | **$61.88** | 2 days | | |
| | **NEW: Pro Plan** | 6Γ H100 | 4.75 hours | $61.88 | $20 | **$81.88** | 1 day | | |
| **Savings vs Old Plan:** | |
| - Free tier: Save $838-$1,138 and 6 days | |
| - Pro plan: Save $818-$1,118 and 6 days | |
| ### Pro Plan ROI Analysis | |
| **When is Pro worth it?** | |
| - $20 extra to save 20 hours (24h β 4h) | |
| - = **$1/hour saved** | |
| - Plus: Can test multiple hyperparameters same day | |
| - Plus: Includes $13/year in credits | |
| **Get Pro if:** | |
| - β You value time over $1/hour | |
| - β Planning to iterate on hyperparameters | |
| - β Need results urgently | |
| - β Want to test 99k + 500k + 3M in one session | |
| **Skip Pro if:** | |
| - β Doing one-time training only | |
| - β Can wait 24 hours | |
| - β Budget constrained | |
| - β 99k test was sufficient | |
| ### Next Steps | |
| Once plan is approved: | |
| 1. Set up Lightning AI account with A100 GPU access | |
| 2. Clone diffusers repository and install requirements | |
| 3. Verify dataset access and download capabilities | |
| 4. Prepare validation QR codes for quality testing | |
| 5. Launch training with recommended hyperparameters | |
| 6. Monitor via Weights & Biases for loss curves and validation images | |
| 7. Evaluate checkpoints at 10k, 25k, 50k steps | |
| 8. Complete training and publish to HuggingFace Hub | |
| 9. Integrate into `app_sdxl.py` for production use | |
| **Flux Schnell** remains an option for future exploration once SDXL is production-ready, but is deprioritized due to experimental nature and higher resource requirements. | |