Spaces:
Sleeping
Sleeping
File size: 58,973 Bytes
ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 82b2bf3 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc 4670d25 ce13bdc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 |
# Training ControlNet Brightness for SDXL - Feasibility Analysis
## Executive Summary
Training a brightness ControlNet for SDXL is **technically feasible and recommended** as the critical upgrade path from SD 1.5 to SDXL for QR code generation. This model is essential because no public SDXL brightness ControlNet exists.
**Key Estimates (Updated December 2024 - Single H100 GPU):**
- **Time**: 45 minutes (99k samples) to 24 hours (3M samples) on single H100
- **Cost**: $13 (99k) to $418 (3M) in GPU credits
- **Platform**: Lightning.ai with optional Pro plan ($20/month for multi-GPU)
- **Priority**: High - enables SDXL migration for QR code generation
- **Complexity**: Medium - well-documented training pipeline with reference implementation
**Recommended Path:**
- Start with single H100 for 99k samples (~45 min, $13)
- If successful, optionally upgrade to Pro plan for faster 3M training
- Total investment: $13-$138 depending on training size and plan choice
## Background Context
### Current Implementation (SD 1.5)
- **Location**: `app.py:1880-1886, 2343-2349`
- **Model**: `control_v1p_sd15_brightness.safetensors` from latentcat/latentcat-controlnet
- **Purpose**: Controls QR code pattern visibility via brightness conditioning
- **Critical**: Essential for QR code readability - cannot be removed
### Why SDXL Brightness ControlNet is Needed
1. **No Public Alternative**: No SDXL-equivalent brightness ControlNet exists on HuggingFace
2. **Migration Blocker**: Current SD 1.5 brightness ControlNet incompatible with SDXL architecture
3. **QR Readability**: Brightness control is core to balancing aesthetic quality with QR scannability
4. **Flux is Too Heavy**: SDXL is the practical upgrade path (Flux requires 32-40GB VRAM)
### Flux Model Landscape (Updated Analysis)
**Flux Schnell (Apache 2.0 License)**
- **License**: Fully open for commercial use - no restrictions
- **Architecture**: Same 12B parameters as Flux Dev, but distilled for speed (3Γ faster)
- **Quality**: Lower than Dev due to aggressive distillation trading detail for speed
- **VRAM**: Still requires 32-40GB (same as Dev)
- **ControlNet Status**: β οΈ **No existing ControlNet models or training scripts**
- **Training Risk**: Would require adapting Flux Dev training script - pioneering work
- **Community**: Active requests for Schnell ControlNets but no official releases
**Flux Dev (Non-Commercial License)**
- **License**: Non-commercial only - cannot be used for commercial QR code generation
- **ControlNet Status**: β
Extensive support (XLabs-AI, InstantX collections)
- **Training Scripts**: Available from XLabs-AI and HuggingFace Diffusers
- **Quality**: Superior to Schnell, but license restrictions make it unsuitable
**Flux Pro (Commercial API)**
- **License**: API-only, commercial pricing
- **Status**: Not suitable for self-hosted training
**Assessment**: While Flux Schnell has an attractive license, the lack of proven ControlNet training pipeline makes it **high-risk**. SDXL remains the **proven, practical choice**.
## Hardware Selection & Platform Strategy
### Lightning.ai Pricing Tiers (December 2024)
Lightning.ai offers different tiers with varying multi-GPU capabilities:
| Plan | Cost | Multi-GPU | Max GPUs | Credits Included | Best For |
|------|------|-----------|----------|------------------|----------|
| **Free** | $0 | β No | 1 | 15/month | Quick 99k test |
| **Pro** | **$20/month** (annual) | β
Yes | 6 | 240/year (~$13/mo) | **Recommended** |
| Teams | $119/month (annual) | β
Yes | 12 | 600/year | Large teams |
**Pro Plan Benefits:**
- Only **$20/month** if paid annually ($240/year vs $600 monthly)
- Includes **240 credits/year** = ~$13 of free GPU time
- **Net cost: ~$7/month** after credits
- Multi-GPU training up to 6 GPUs
- Can cancel after training completes
### GPU Comparison Analysis (Lightning.ai)
**Single GPU Performance:**
| GPU | TFLOPs | Memory | Cost/hr | 99k Time | 99k Cost | 3M Time | 3M Cost |
|-----|--------|--------|---------|----------|----------|---------|---------|
| A100 | 312 | 40GB | ~$1.50 | 4-6 hours | $6-9 | 120-180 hours | $180-270 |
| **H100** | **1979** | **80GB** | **~$2.50** | **45 min** | **$1.88** | **24 hours** | **$60** |
**Cost Efficiency:**
- H100 is **6.3Γ faster** than A100 (1979 vs 312 TFLOPs)
- H100 costs **1.67Γ more** per hour on Lightning.ai
- **Net result: 3.8Γ better cost efficiency**
### Single vs Multi-GPU: Should You Get Pro Plan?
#### Option A: Free Plan (Single H100)
| Training Size | Duration | GPU Cost | Total Cost | Timeline |
|---------------|----------|----------|------------|----------|
| 99k samples | 45 min | $1.88 | **$1.88** | Same day |
| 500k samples | 4 hours | $10 | **$10** | Same day |
| 3M samples | 24 hours | $60 | **$60** | 1-2 days |
**Pros:**
- β
$0 subscription cost
- β
Very cheap for 99k testing
- β
Good for one-off training
**Cons:**
- β 24 hours for 3M training (must babysit)
- β Can't test multiple hyperparameters quickly
- β Limited to 15 free credits/month
#### Option B: Pro Plan (6Γ H100)
| Training Size | Duration | GPU Cost | Subscription | Total Cost | Timeline |
|---------------|----------|----------|--------------|------------|----------|
| 99k samples | **7.5 min** | $1.88 | $20 | **$21.88** | Minutes |
| 500k samples | **40 min** | $10 | $20 | **$30** | Same hour |
| 3M samples | **4 hours** | $60 | $20 | **$80** | Same day |
**Multi-GPU costs same because:**
- 6Γ GPUs = 6Γ faster
- 6Γ GPUs = 6Γ more expensive per hour
- Net: Same total GPU cost, much faster completion
**Pros:**
- β
3M training finishes in 4 hours (vs 24)
- β
Can test 3-4 hyperparameter configs in one day
- β
Includes 240 credits/year (~$13 value)
- β
Real net cost: $7/month after credits
- β
Can cancel after training done
**Cons:**
- β $20 upfront cost (annual commitment)
### Recommendation Matrix
**If you're doing ONE 99k training run:**
- β
**Use Free tier** ($1.88 total, 45 min)
- Skip Pro plan - not worth $20 for 7.5 min vs 45 min
**If you're doing 500k OR 3M training:**
- β
**Get Pro plan** ($20/month)
- 3M: 4 hours vs 24 hours = worth it
- Can test multiple configs same day
- Net cost after credits: ~$7/month
**If you're doing multiple experiments:**
- β
**Definitely get Pro plan**
- Test 99k + 500k + 3M all in one day
- Total time: ~5 hours vs 30+ hours
- Total cost: $20 + ~$72 GPU = $92
- Cancel Pro after training complete
**Most Cost-Effective Strategy:**
1. Start with **Free tier** for 99k test ($1.88, 45 min)
2. If results promising, upgrade to **Pro** for 3M training
3. Run full training in 4 hours
4. Cancel Pro after done
5. Total: $20 Pro + $60 GPU + $1.88 test = **$81.88**
### Updated Training Timeline Estimates
**Single H100 (Free Tier):**
| Training Size | Duration | Total Cost | When to Use |
|---------------|----------|------------|-------------|
| **99k samples** | 45 min | $1.88 | Quick validation, hyperparameter testing |
| **500k samples** | 4 hours | $10 | Medium quality, budget option |
| **3M samples** | 24 hours | $60 | Max quality, have patience |
**6Γ H100 (Pro Plan at $20/month):**
| Training Size | Duration | Total Cost | When to Use |
|---------------|----------|------------|-------------|
| **99k samples** | 7.5 min | $21.88 | Ultra-fast iteration |
| **500k samples** | 40 min | $30 | Production ready, same day |
| **3M samples** | 4 hours | $80 | Best quality, same day results |
## Training Strategy
### Dataset: latentcat/grayscale_image_aesthetic_3M
- **Size**: 3 million images at 512Γ512 resolution
- **Format**: Parquet files with image/conditioning_image/text columns
- **Same Dataset**: Used for original SD 1.5 brightness ControlNet training
- **License**: Latent Cat (check license before commercial use)
- **Quality**: Pre-processed grayscale images with aesthetic filtering
### Reference Training Results (from latentcat article)
| Configuration | Samples | Hardware | Duration | Cost Estimate |
|--------------|---------|----------|----------|---------------|
| Original SD 1.5 | 100k | A6000 | 13 hours | ~$20 (est.) |
| Original SD 1.5 | 3M | TPU v4-8 | 25 hours | N/A (TPU) |
### SDXL Training Scaling Estimates
**Updated Based on Latentcat Article:**
- Training at 512Γ512 resolution (NOT 1024Γ1024) - matches dataset and original training
- SDXL has larger UNet architecture (~2.5GB vs 1.7GB for SD 1.5)
- Expected slowdown: 2-3Γ compared to SD 1.5 training
**Time Estimates for 99k Training Samples (Lightning.ai Single H100):**
## Calculation Methodology
**Baseline Reference:**
- Latentcat article: 100k samples on A6000 = 13 hours (SD 1.5)
- SDXL overhead: 13h Γ 2.5 (larger architecture) = ~32.5 hours for 100k
- A6000 β A100 in performance (~300-312 TFLOPs)
**Scaling to H100:**
- A100: 312 TFLOPs β ~4-6 hours for 99k samples
- H100: 1979 TFLOPs β 6.3Γ faster
- **H100 single GPU: ~38-57 minutes for 99k samples**
**Multi-GPU Scaling (Pro Plan):**
- 6Γ H100 GPUs = 6Γ faster = ~7.5 minutes for 99k
- Total cost stays same (6Γ faster but 6Γ more expensive/hour)
## Recommended Configurations
**π OPTION 1: Free Tier (Single H100) - Best for Testing**
- **99k samples**: 45 min, $1.88
- **500k samples**: 4 hours, $10
- **3M samples**: 24 hours, $60
- **Best for:** One-off training, budget-conscious, have patience
**π OPTION 2: Pro Plan (6Γ H100) - Best for Production**
- **Subscription**: $20/month (annual), includes $13 credits = **$7 net cost**
- **99k samples**: 7.5 min, $21.88 total ($1.88 GPU + $20 sub)
- **500k samples**: 40 min, $30 total ($10 GPU + $20 sub)
- **3M samples**: 4 hours, $80 total ($60 GPU + $20 sub)
- **Best for:** Multiple experiments, 3M training, need results same day
**Cost Comparison Summary:**
| Scenario | Free Tier | Pro Plan | Savings (Pro) |
|----------|-----------|----------|---------------|
| Single 99k test | $1.88 | $21.88 | β $20 more |
| Single 3M training | $60 | $80 | β $20 more |
| 99k + 500k + 3M | $71.88 (30 hours) | $92 (5 hours) | β
Save 25 hours |
| 3+ experiments | $71.88+ (30+ hours) | $92 (5-6 hours) | β
Save 24+ hours |
**Recommendation:**
- For single 99k test: **Use Free Tier** (not worth $20 for speed)
- For 3M training: **Consider Pro** (4 hrs vs 24 hrs = big difference)
- For multiple runs: **Definitely Pro** (can test everything in one day)
## Technical Implementation Plan
### Dataset Verification Script
**Create this script to verify dataset before training:**
```bash
cat > verify_dataset.py << 'EOF'
#!/usr/bin/env python3
"""
Dataset verification script for SDXL ControlNet Brightness training.
Downloads a subset of the dataset and verifies structure.
Usage: python verify_dataset.py
"""
from datasets import load_dataset
from PIL import Image
import sys
def verify_dataset():
print("=" * 60)
print("SDXL ControlNet Brightness - Dataset Verification")
print("=" * 60)
print("\n[1/4] Loading dataset subset (99k samples)...")
print("This will download ~10-15GB to cache...")
try:
train_dataset = load_dataset(
"latentcat/grayscale_image_aesthetic_3M",
split="train[:99000]",
cache_dir="~/.cache/huggingface/datasets"
)
print(f"β
Successfully loaded {len(train_dataset)} samples")
except Exception as e:
print(f"β Failed to load dataset: {e}")
sys.exit(1)
print("\n[2/4] Verifying dataset structure...")
expected_columns = {"image", "conditioning_image", "text"}
actual_columns = set(train_dataset.column_names)
if actual_columns == expected_columns:
print(f"β
Columns correct: {train_dataset.column_names}")
else:
print(f"β Column mismatch!")
print(f" Expected: {expected_columns}")
print(f" Got: {actual_columns}")
sys.exit(1)
print("\n[3/4] Checking sample data...")
sample = train_dataset[0]
# Check images
if isinstance(sample['image'], Image.Image):
img_size = sample['image'].size
print(f"β
Image type: PIL.Image, size: {img_size}")
else:
print(f"β Unexpected image type: {type(sample['image'])}")
if isinstance(sample['conditioning_image'], Image.Image):
cond_size = sample['conditioning_image'].size
print(f"β
Conditioning image type: PIL.Image, size: {cond_size}")
else:
print(f"β Unexpected conditioning image type: {type(sample['conditioning_image'])}")
if isinstance(sample['text'], str):
caption_len = len(sample['text'])
print(f"β
Caption type: str, length: {caption_len} chars")
print(f" Sample caption: '{sample['text'][:100]}...'")
else:
print(f"β Unexpected caption type: {type(sample['text'])}")
print("\n[4/4] Checking validation split (last 1000 samples)...")
try:
# IMPORTANT: Always use last 1000 samples for validation
# This ensures consistent validation across all training sizes
val_dataset = load_dataset(
"latentcat/grayscale_image_aesthetic_3M",
split="train[2999000:3000000]",
cache_dir="~/.cache/huggingface/datasets"
)
print(f"β
Validation split loaded: {len(val_dataset)} samples")
print(f" Validation uses: train[2999000:3000000] (last 1k)")
except Exception as e:
print(f"β Failed to load validation split: {e}")
sys.exit(1)
print("\n" + "=" * 60)
print("β
ALL CHECKS PASSED!")
print("=" * 60)
print(f"\nDataset cached at: ~/.cache/huggingface/datasets/")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"\nβ οΈ IMPORTANT: Validation always uses samples 2,999,000-2,999,999")
print(f" This ensures consistent validation across all training sizes")
print(f" (99k, 500k, 3M all use same validation set)")
print(f"\nYou can now proceed with training!")
print("The training script will automatically use this cached data.")
if __name__ == "__main__":
verify_dataset()
EOF
```
**Make executable and run**:
```bash
chmod +x verify_dataset.py
python verify_dataset.py
```
**Expected output**: Should confirm dataset structure and cache the first 100k samples.
### Manual Preparation Checklist (Do This First!)
**Split into two phases to minimize GPU costs:**
---
## Part A: Local Preparation (BEFORE Launching GPU Instance)
**Do these steps on your local machine or any CPU instance - no GPU needed, $0 cost:**
#### Step 1: Get Your Authentication Tokens
**Prepare these before launching GPU:**
- **HuggingFace token**: https://huggingface.co/settings/tokens (create "Read" access token)
- **W&B API key**: https://wandb.ai/authorize
Save these somewhere - you'll need them on the GPU instance.
#### Step 2: Prepare Dataset Verification Script Locally
The full `verify_dataset.py` script is provided in the "Dataset Verification Script" section above (under Technical Implementation Plan).
You can either:
- Copy that script to a file on your local machine, OR
- Recreate it directly on the GPU instance in Part B below
No need to prepare this locally if you prefer to create it on the GPU instance.
---
## Part B: GPU Instance Setup (AFTER Launching GPU, BEFORE Training)
**Complete these steps on your GPU instance to avoid wasting GPU credits on training failures:**
**Estimated time: 30-60 minutes (mostly dataset download)**
**GPU credits used: ~$0.75-$1.50** (30-60 min @ $1.55/hr for A100)
#### Step 1: System Dependencies
```bash
# Update system packages
sudo apt-get update && sudo apt-get install -y git git-lfs build-essential
# Initialize Git LFS
git lfs install
```
#### Step 2: Python Environment with CUDA
```bash
# Install PyTorch with CUDA 11.8 (requires GPU instance!)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install core ML libraries
pip install diffusers transformers accelerate datasets
# Install utilities
pip install huggingface_hub pillow wandb xformers bitsandbytes
```
#### Step 3: Verify CUDA (Critical!)
```bash
# Verify CUDA availability - MUST show "True"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name(0)}')"
```
**Expected output:**
```
CUDA available: True
CUDA version: 11.8
GPU: NVIDIA A100-SXM4-40GB
```
**If CUDA shows False:** Stop and troubleshoot before proceeding!
#### Step 4: Clone Training Repository
```bash
# Clone HuggingFace diffusers
git clone https://github.com/huggingface/diffusers.git
cd diffusers/examples/controlnet
# Verify training script exists
ls -la train_controlnet_sdxl.py # Should show the file
```
#### Step 5: Authentication Setup
```bash
# Login to HuggingFace (use token from Part A)
huggingface-cli login
# Paste your token when prompted
# Login to Weights & Biases (use API key from Part A)
wandb login
# Paste your API key when prompted
```
#### Step 6: Dataset Verification (CRITICAL!)
```bash
# Create the verify_dataset.py script using the code from
# "Dataset Verification Script" section at the top of this plan
# (See lines after "Technical Implementation Plan" heading)
# Once created, run it:
chmod +x verify_dataset.py
python verify_dataset.py
```
**Expected output:**
```
============================================================
SDXL ControlNet Brightness - Dataset Verification
============================================================
[1/4] Loading dataset subset (99k samples)...
This will download ~10-15GB to cache...
β
Successfully loaded 99000 samples
[2/4] Verifying dataset structure...
β
Columns correct: ['image', 'conditioning_image', 'text']
[3/4] Checking sample data...
β
Image type: PIL.Image, size: (512, 512)
β
Conditioning image type: PIL.Image, size: (512, 512)
β
Caption type: str, length: 87 chars
[4/4] Checking validation split (last 1000 samples)...
β
Validation split loaded: 1000 samples
Validation uses: train[2999000:3000000] (last 1k)
============================================================
β
ALL CHECKS PASSED!
============================================================
Dataset cached at: ~/.cache/huggingface/datasets/
Training samples: 99000
Validation samples: 1000
β οΈ IMPORTANT: Validation always uses samples 2,999,000-2,999,999
This ensures consistent validation across all training sizes
(99k, 500k, 3M all use same validation set)
You can now proceed with training!
```
#### Step 7: Pre-Flight Verification
```bash
# Check all packages are installed
pip list | grep -E "torch|diffusers|transformers|accelerate|datasets|xformers"
# Check disk space (need ~20GB free for checkpoints)
df -h ~
# Verify dataset cache exists
ls -lh ~/.cache/huggingface/datasets/
```
#### Step 8: Create Output Directory
```bash
# Create directory for training outputs
mkdir -p ~/controlnet-brightness-sdxl
# Return to training directory
cd ~/diffusers/examples/controlnet
```
---
## β
Preparation Complete!
**Once all Part B steps pass, you're ready to start GPU training.**
The training command (shown in Phase 3 below) will now:
- β
Use pre-downloaded dataset from cache (no re-download)
- β
Have all required libraries installed with CUDA support
- β
Be authenticated to HuggingFace and W&B
- β
Save checkpoints to the prepared directory
**Total preparation cost:** ~$0.75-$1.50 (vs $35 for full training)
**Why worth it:** Catches setup issues early without wasting 25 hours of GPU time
**Hardware Selection (Updated for Lightning.ai):**
- **π RECOMMENDED FOR TESTING**: Single H100 on Free Tier
- 99k training in 45 min for $1.88
- Perfect for validation and hyperparameter tuning
- 80GB VRAM allows good batch sizes
- No subscription required
- **π RECOMMENDED FOR PRODUCTION**: 6Γ H100 on Pro Plan ($20/month annual)
- 3M training in 4 hours for $80 total
- Can test multiple configs in one day
- Net cost: ~$7/month after included credits
- Cancel subscription after training complete
- **Not Recommended**: A100 - H100 is faster and more cost-efficient
### Phase 2: Dataset Preparation
**Dataset Split Strategy (for 99k quick training):**
- **Training**: 99,000 samples (`split="train[:99000]"`)
- **Validation**: 1,000 samples (`split="train[2999000:3000000]"`) - **ALWAYS last 1k**
- **Total loaded**: 100,000 samples (99k + last 1k of 3M dataset)
**β οΈ CRITICAL: Validation Always Uses Last 1000 Samples**
- All training sizes (99k, 500k, 3M) use `train[2999000:3000000]` for validation
- This ensures consistent validation set across all training runs
- Allows fair comparison of model quality at different training stages
- No overlap between training and validation for any training size
**Why This Matters:**
```
β WRONG: Using different validation sets for different training sizes
- 99k training: train[:99000] + validation train[99000:100000]
- 500k training: train[:499000] + validation train[499000:500000]
- 3M training: train[:2999000] + validation train[2999000:3000000]
Problem: Can't compare results! Each uses different validation data.
β
CORRECT: Same validation set for all training sizes
- 99k training: train[:99000] + validation train[2999000:3000000]
- 500k training: train[:499000] + validation train[2999000:3000000]
- 3M training: train[:2999000] + validation train[2999000:3000000]
Benefit: Fair comparison across all training runs on same validation set.
```
### Understanding HuggingFace Dataset Caching
**Important**: The HuggingFace `datasets` library automatically caches all downloads to `~/.cache/huggingface/datasets/`. This means:
β
**Cache reuse is automatic**: When the training script runs, it will check the cache first and reuse any previously downloaded data
β
**No re-downloads**: You won't download the full 3M dataset if you've already downloaded a subset
β
**The pre-download step is OPTIONAL**: The training command can handle downloading on its own
**Pre-download Benefits**:
- Verify dataset structure before training starts
- Separate download time from training time
- Ensure dataset access works before committing GPU hours
**Pre-download is NOT required**: The training script's `--max_train_samples=99000` parameter will work whether you pre-download or not.
### Dataset Download Options
**Option A: Pre-download for verification (RECOMMENDED)**
```python
from datasets import load_dataset
# This downloads and caches ~100k samples for verification
train_dataset = load_dataset(
"latentcat/grayscale_image_aesthetic_3M",
split="train[:99000]",
cache_dir="~/.cache/huggingface/datasets" # Default cache location
)
# Verify the dataset structure
print(f"Dataset size: {len(train_dataset)}")
print(f"Columns: {train_dataset.column_names}")
print(f"First sample keys: {train_dataset[0].keys()}")
# Check a sample
sample = train_dataset[0]
print(f"Image size: {sample['image'].size}")
print(f"Conditioning image size: {sample['conditioning_image'].size}")
print(f"Caption: {sample['text']}")
```
**Option B: Let training script handle download**
- Simply run the training command with `--dataset_name` and `--max_train_samples`
- The script will download to cache automatically
- Slightly riskier if there are dataset access issues
**Recommended:** Use the full `verify_dataset.py` script (see "Dataset Verification Script" section above) which implements Option A with comprehensive validation checks.
**Data Format Validation:**
- Verify columns: `image`, `conditioning_image`, `text`
- Check image resolution: 512Γ512 (will be upscaled to 1024Γ1024 by script)
- Validate grayscale format
**Steps Calculation (IMPORTANT):**
- Training samples: 99,000
- Batch size: 16
- Gradient accumulation: 4
- **Effective batch size**: 16 Γ 4 = 64 samples/step
- **Steps per epoch**: 99,000 Γ· 64 = 1,547 steps
- **For 2 epochs**: ~3,094 total steps
### Phase 3: Training Configuration
**Prerequisites:** Complete the "Manual Preparation Checklist" above before running this command.
**Training Command (Based on Latentcat Article):**
```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl"
accelerate launch train_controlnet_sdxl.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--dataset_name="latentcat/grayscale_image_aesthetic_3M" \
--max_train_samples=99000 \
--conditioning_image_column="conditioning_image" \
--image_column="image" \
--caption_column="text" \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--resolution=512 \
--learning_rate=1e-5 \
--train_batch_size=16 \
--gradient_accumulation_steps=4 \
--num_train_epochs=2 \
--checkpointing_steps=1500 \
--validation_steps=1500 \
--tracker_project_name="brightness-controlnet-sdxl" \
--report_to="wandb" \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--use_8bit_adam
```
**Key Parameters Explained:**
- `--max_train_samples=99000`: Limit to 99k samples (reserves 1k for validation)
- `--resolution=512`: Match dataset resolution (latentcat article used 512, not 1024)
- `--learning_rate=1e-5`: From latentcat article
- `--train_batch_size=16`: From latentcat article
- `--gradient_accumulation_steps=4`: Effective batch = 16 Γ 4 = 64
- `--num_train_epochs=2`: From latentcat article
- **`--checkpointing_steps=1500`**: Save every 1500 STEPS (~once per epoch)
- Total training: ~3,094 steps for 2 epochs
- Checkpoints at: 1500, 3000 steps
- **`--validation_steps=1500`**: Run validation every 1500 STEPS
- `--gradient_checkpointing`: Reduces VRAM usage
- `--use_8bit_adam`: Memory optimization
- `--enable_xformers_memory_efficient_attention`: Memory-efficient attention
**Critical Understanding - Steps vs Samples:**
- 1 STEP = processing 1 effective batch = 64 samples
- Checkpoint every 1500 steps = every 1500 Γ 64 = 96,000 samples (~1 epoch)
- NOT checkpoint every 1500 samples!
- Total steps for 2 epochs: 99,000 Γ· 64 Γ 2 = 3,094 steps
**VRAM Requirements with These Settings:**
The settings above are optimized for memory efficiency:
- `--mixed_precision="fp16"`: Halves memory usage
- `--gradient_checkpointing`: Trades compute for memory (~40% VRAM savings)
- `--use_8bit_adam`: Reduces optimizer state memory
- `--enable_xformers_memory_efficient_attention`: Memory-efficient attention
**Estimated VRAM usage:**
- SDXL base model (FP16): ~6-7GB
- ControlNet model: ~2.5GB
- 8-bit Adam optimizer states: ~3-4GB
- Gradients (with checkpointing): ~2-3GB
- Activations (batch 16, 512Γ512, gradient checkpointing): ~8-12GB
- **Total: ~22-28GB peak**
**GPU Compatibility:**
| GPU | VRAM | Will It Fit? | Batch Size | Notes |
|-----|------|--------------|------------|-------|
| **L4** | 24GB | β οΈ Tight | 8-12 | Reduce `--train_batch_size` to 8 or 12 |
| **A100 40GB** | 40GB | β
Yes | 16 | **Recommended** - comfortable fit |
| **A100 80GB** | 80GB | β
Yes | 16-24 | Plenty of headroom, can increase batch |
| **H100 80GB** | 80GB | β
Yes | 16-24 | Fastest training, plenty of VRAM |
**Recommended: A100 40GB** - The settings will fit comfortably with batch size 16.
**If using L4 24GB**, modify the command:
```bash
# Change this line:
--train_batch_size=16 \
# To:
--train_batch_size=8 \
```
This keeps effective batch size = 8 Γ 4 = 32 (half of 64), but still works well.
### Accelerate Configuration for Multi-GPU Training
**Important:** Multi-GPU training on Lightning.ai requires the Pro plan ($20/month annual).
#### Single GPU (Free Tier) - No Configuration Needed
For single GPU training on Free tier, `accelerate launch` works without any configuration:
```bash
# No accelerate config needed - auto-detects single GPU
accelerate launch train_controlnet_sdxl.py [args...]
```
#### Multi-GPU (Pro Plan) - Configure Before Training
For 6Γ H100 training on Pro plan, configure accelerate once:
```bash
# Run configuration wizard
accelerate config
```
**Configuration Options for 6Γ H100:**
```yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU # Use DataParallel for multiple GPUs
num_machines: 1 # Single machine with 6 GPUs
num_processes: 6 # One process per GPU
gpu_ids: all # Use all available GPUs
mixed_precision: fp16 # Match training script
use_cpu: false
dynamo_backend: NO # Disable torch.compile for compatibility
```
**Quick Config (Non-Interactive):**
```bash
# Create accelerate config file directly
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'EOF'
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 6
gpu_ids: all
mixed_precision: fp16
use_cpu: false
dynamo_backend: NO
EOF
```
**Verify Configuration:**
```bash
# Check configuration
accelerate env
# Test multi-GPU setup
accelerate test
```
**Launch Multi-GPU Training:**
```bash
# With configuration file, launch works same as single GPU
accelerate launch train_controlnet_sdxl.py [args...]
# Or specify config explicitly
accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \
train_controlnet_sdxl.py [args...]
```
### H100-Optimized Training Parameters
The H100 GPU has **80GB VRAM** and **1979 TFLOPs**, allowing for larger batch sizes and better optimization than A100.
#### Optimal Batch Size for H100
**Default settings (designed for A100 40GB):**
```bash
--train_batch_size=16
--gradient_accumulation_steps=4
# Effective batch size: 16 Γ 4 = 64 samples/step
# VRAM usage: ~22-28GB
```
**H100-optimized settings (80GB VRAM):**
```bash
--train_batch_size=32 # 2Γ larger than A100
--gradient_accumulation_steps=4
# Effective batch size: 32 Γ 4 = 128 samples/step
# VRAM usage: ~40-48GB (still plenty of headroom)
```
**Aggressive H100 settings (maximum throughput):**
```bash
--train_batch_size=48 # 3Γ larger than A100
--gradient_accumulation_steps=2 # Reduce accumulation since batch is larger
# Effective batch size: 48 Γ 2 = 96 samples/step
# VRAM usage: ~55-65GB
# Faster training due to fewer gradient accumulation steps
```
#### Single H100 Training Command (99k samples)
**Optimized for H100 80GB:**
```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-h100"
accelerate launch train_controlnet_sdxl.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--dataset_name="latentcat/grayscale_image_aesthetic_3M" \
--max_train_samples=99000 \
--conditioning_image_column="conditioning_image" \
--image_column="image" \
--caption_column="text" \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--resolution=512 \
--learning_rate=1e-5 \
--train_batch_size=32 \
--gradient_accumulation_steps=4 \
--num_train_epochs=2 \
--checkpointing_steps=750 \
--validation_steps=750 \
--tracker_project_name="brightness-controlnet-sdxl-h100" \
--report_to="wandb" \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--use_8bit_adam \
--dataloader_num_workers=8 \
--set_grads_to_none
```
**Key H100 Optimizations:**
- `--train_batch_size=32` (vs 16 on A100) - 2Γ larger batches
- `--gradient_accumulation_steps=4` - Effective batch = 128
- `--checkpointing_steps=750` - More frequent (every ~96k samples)
- `--dataloader_num_workers=8` - Faster data loading (H100 has 192 CPUs)
- `--set_grads_to_none` - Faster than zero_grad() on modern GPUs
**Expected Performance:**
- Steps per epoch: 99,000 Γ· 128 = 773 steps
- Total steps (2 epochs): ~1,546 steps
- Training time: ~38-45 minutes on single H100
- Checkpoints saved at: 750, 1500 steps
#### 6Γ H100 Training Command (3M samples) - Pro Plan
**For Pro plan multi-GPU training:**
```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-multi-h100"
# Configure accelerate for 6 GPUs (if not done already)
accelerate config # Select MULTI_GPU, 6 processes
# Launch training
accelerate launch train_controlnet_sdxl.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--dataset_name="latentcat/grayscale_image_aesthetic_3M" \
--max_train_samples=2999000 \
--conditioning_image_column="conditioning_image" \
--image_column="image" \
--caption_column="text" \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--resolution=512 \
--learning_rate=1e-5 \
--train_batch_size=24 \
--gradient_accumulation_steps=2 \
--num_train_epochs=1 \
--checkpointing_steps=2500 \
--validation_steps=2500 \
--tracker_project_name="brightness-controlnet-sdxl-3M" \
--report_to="wandb" \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--use_8bit_adam \
--dataloader_num_workers=8 \
--set_grads_to_none \
--resume_from_checkpoint="latest"
```
**Multi-GPU Optimizations:**
- `--train_batch_size=24` per GPU Γ 6 GPUs = 144 samples per step (before accumulation)
- `--gradient_accumulation_steps=2` - Effective batch = 144 Γ 2 = 288
- `--checkpointing_steps=2500` - Save every ~720k samples
- `--resume_from_checkpoint="latest"` - Auto-resume if interrupted
**Expected Performance:**
- Effective batch size: 288 samples/step
- Steps per epoch: 2,999,000 Γ· 288 = ~10,413 steps
- Training time: ~4 hours on 6Γ H100
- Checkpoints: 2500, 5000, 7500, 10000 steps + final
#### Batch Size Selection Guide
| GPU Config | VRAM | Recommended batch_size | grad_accum_steps | Effective Batch | Training Speed |
|------------|------|------------------------|------------------|-----------------|----------------|
| Single L4 | 24GB | 8 | 4 | 32 | Slow (baseline) |
| Single A100 | 40GB | 16 | 4 | 64 | 2Γ faster than L4 |
| Single H100 | 80GB | 32 | 4 | 128 | 6Γ faster than L4 |
| 6Γ H100 (Pro) | 480GB | 24/GPU | 2 | 288 | 36Γ faster than L4 |
**Rule of Thumb:**
- Larger `train_batch_size` = better GPU utilization, faster training
- Larger `effective_batch_size` = more stable training, better convergence
- H100 can handle 2-3Γ larger batch sizes than A100 with same settings
#### Memory Optimization Tips
**If you encounter OOM (Out of Memory) errors on H100:**
1. **Reduce batch size incrementally:**
```bash
--train_batch_size=32 # Start here
--train_batch_size=24 # If OOM
--train_batch_size=16 # If still OOM
```
2. **Enable additional memory optimizations:**
```bash
--gradient_checkpointing \ # Already enabled
--use_8bit_adam \ # Already enabled
--enable_xformers_memory_efficient_attention \ # Already enabled
--set_grads_to_none \ # Use this instead of zero_grad()
```
3. **Use gradient accumulation to maintain effective batch size:**
```bash
# If reducing from batch_size=32 to batch_size=16
--train_batch_size=16
--gradient_accumulation_steps=8 # Double accumulation to keep effective=128
```
### Full 3M Dataset Training Options
**For maximum quality training on the complete dataset:**
#### Option A: Single H100 (Free Tier)
| Metric | Value |
|--------|-------|
| GPU | 1Γ H100 80GB (~$2.50/hr on Lightning.ai) |
| Dataset | 2,999,000 training + 1,000 validation |
| Estimated Duration | **~24 hours** |
| Estimated Cost | **$60 GPU credits** |
| Subscription Cost | **$0** (Free tier) |
| **Total Cost** | **$60** |
| Checkpoints | Every 5000 steps (~every 320k samples) |
**Pros:**
- β
Lowest total cost
- β
No subscription required
- β
Good for one-time training
**Cons:**
- β 24 hours training time (must monitor)
- β Can't quickly iterate if issues arise
#### Option B: 6Γ H100 (Pro Plan - $20/month)
| Metric | Value |
|--------|-------|
| GPU | 6Γ H100 80GB (~$2.50/hr Γ 6 = $15/hr) |
| Dataset | 2,999,000 training + 1,000 validation |
| Estimated Duration | **~4 hours** |
| Estimated Cost | **$60 GPU credits** |
| Subscription Cost | **$20/month** (annual billing) |
| **Total Cost** | **$80** |
| **Net Cost** | **$67** (after $13 annual credit value) |
| Checkpoints | Every 5000 steps (~every 320k samples) |
**Pros:**
- β
Completes in 4 hours vs 24 hours
- β
Can run same-day if needed
- β
Can test multiple configs quickly
- β
Net cost only $7/month after credits
- β
Can cancel after training
**Cons:**
- β $20 upfront subscription cost
**Scaling Math:**
- Single H100: 99k in 45 min β 3M in 45 min Γ 30.3 = ~24 hours
- 6Γ H100: 24 hours Γ· 6 = ~4 hours
**Cost Comparison:**
- Free tier: $60, 24 hours wait
- Pro plan: $80, 4 hours wait
- **Price difference: $20 to save 20 hours**
#### Adjusted Training Command
```bash
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="./controlnet-brightness-sdxl-3M"
accelerate launch train_controlnet_sdxl.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--dataset_name="latentcat/grayscale_image_aesthetic_3M" \
--max_train_samples=2999000 \
--conditioning_image_column="conditioning_image" \
--image_column="image" \
--caption_column="text" \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--resolution=512 \
--learning_rate=1e-5 \
--train_batch_size=24 \
--gradient_accumulation_steps=4 \
--num_train_epochs=1 \
--checkpointing_steps=5000 \
--validation_steps=5000 \
--validation_prompts="a beautiful garden scene" "modern city street" "abstract art pattern" \
--tracker_project_name="brightness-controlnet-sdxl-3M" \
--report_to="wandb" \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--use_8bit_adam \
--resume_from_checkpoint="latest"
```
#### Key Adjustments Explained
**Batch Size Scaling:**
- **`--train_batch_size=24`** (increased from 16)
- H100 80GB has 2x VRAM of A100 40GB
- Can safely increase batch size by 50%
- Alternative: `--train_batch_size=32` if you have headroom
- **`--gradient_accumulation_steps=4`** (kept same)
- Effective batch size: 24 Γ 4 = **96 samples/step**
- If using batch_size=32: 32 Γ 4 = **128 samples/step**
**Dataset & Checkpointing:**
- **`--max_train_samples=2999000`** (vs 99,000 for quick training)
- Training split: `train[:2999000]` (first 2,999,000 samples)
- **Validation split: `train[2999000:3000000]` (SAME as 99k training!)**
- β
This allows direct comparison of validation metrics between 99k and 3M training
- β
No overlap between training and validation data
- **`--num_train_epochs=1`** (vs 2)
- For 3M samples, 1 epoch is usually sufficient
- Can increase to 2 if quality needs improvement
- **`--checkpointing_steps=5000`** (vs 1,500)
- More frequent checkpoints would create too many files
- 5000 steps = every ~480k samples
- Total checkpoints: ~6-7 for full run
- **`--validation_steps=5000`** (matches checkpointing)
- Run validation at each checkpoint
**Resumption:**
- **`--resume_from_checkpoint="latest"`**
- CRITICAL for multi-day training
- If training crashes, automatically resumes from last checkpoint
- Saves days of retraining if interrupted
#### Training Math
**Steps Calculation:**
- Training samples: 2,999,000 (validation: 1,000)
- Effective batch size: 96 (or 128 with batch_size=32)
- Steps per epoch: 2,999,000 Γ· 96 = **31,240 steps**
- With batch_size=32: 2,999,000 Γ· 128 = **23,429 steps**
- For 1 epoch: 31,240 steps total
- For 2 epochs: 62,480 steps total
**Checkpoints:**
- Saved every 5,000 steps
- Checkpoint locations: steps 5000, 10000, 15000, 20000, 25000, 30000, 31240 (final)
- Each checkpoint: ~2.5GB (ControlNet weights)
- Total storage: ~20GB for all checkpoints + training state
#### VRAM Usage (H100 80GB)
With batch_size=24:
- SDXL base model (FP16): ~6-7GB
- ControlNet model: ~2.5GB
- 8-bit Adam optimizer: ~3-4GB
- Gradients (with checkpointing): ~3-4GB
- Activations (batch 24): ~15-20GB
- **Total: ~35-40GB** β
Fits comfortably in 80GB
With batch_size=32 (max):
- Activations increase to ~20-25GB
- **Total: ~42-48GB** β
Still fits with headroom
**Recommended:** Start with batch_size=24, monitor VRAM in W&B, can increase to 32 if using <60GB.
#### Risk Mitigation for Long Training
**Strategy 1: Incremental Training**
```bash
# Start with 500k samples to validate approach
--max_train_samples=500000
# Cost: ~$150, Duration: ~75 hours
# If results good, continue to full 3M
```
**Strategy 2: Early Checkpoint Evaluation**
```bash
# Evaluate quality at checkpoints:
# - checkpoint-5000 (~480k samples, ~32 hours, ~$63)
# - checkpoint-10000 (~960k samples, ~64 hours, ~$127)
# - checkpoint-15000 (~1.4M samples, ~96 hours, ~$191)
# Can stop early if quality plateaus
```
**Strategy 3: Use Spot Instances**
- Many cloud providers offer H100 spot instances at 50-70% discount
- Cost could drop to $0.60-$1.00/hr (~$270-$600 total)
- Requires `--resume_from_checkpoint="latest"` (already included)
- Risk: Training may be interrupted, but will resume automatically
#### When to Use Full 3M Training
**Use 99k samples if:**
- β
First time training ControlNet
- β
Testing hyperparameters
- β
Budget constrained (<$50)
- β
Need results quickly (1-2 days)
**Use 3M samples if:**
- β
99k results are good but want better quality
- β
Commercial production use (worth the investment)
- β
Training other ControlNet types (can reuse knowledge)
- β
Contributing to research/community (publishable results)
- β
Budget allows ($900-$1,200)
### Phase 4: Training Monitoring
**Setup Weights & Biases:**
```bash
wandb login
# Use wandb to track:
# - Loss curves
# - Validation images every 500 steps
# - Learning rate schedule
# - GPU utilization
```
**Checkpoints:**
- Saved every 1,500 steps to `$OUTPUT_DIR/checkpoint-{step}`
- With ~3,094 total steps, will get checkpoints at:
- `checkpoint-1500` (~97% of epoch 1)
- `checkpoint-3000` (~94% of epoch 2)
- Final model at end of training
- Can resume training if interrupted: `--resume_from_checkpoint="./controlnet-brightness-sdxl/checkpoint-1500"`
**Validation:**
- Uses 1,000 validation samples from `train[99000:100000]`
- Runs every 1,500 steps (at checkpoints)
- W&B logs validation images and metrics
- No need for manual validation prompts/images
### Validation Metrics (Automatic)
**No configuration needed!** The training script automatically computes validation metrics:
**Loss Function (Automatic)**:
- **Default**: MSE (Mean Squared Error) loss between predicted and target images
- **Optional**: Huber loss - add `--loss_type="huber"` to training command
- **Formula**: `loss = F.mse_loss(model_pred.float(), target.float())`
**What Gets Logged to W&B**:
1. **Training loss** (every step)
2. **Validation loss** (every `--validation_steps=1500` steps)
3. **Validation images** (generated samples at validation time)
4. **Learning rate** (schedule tracking)
5. **GPU utilization** (hardware monitoring)
**Validation Process**:
1. Every 1500 steps, training pauses
2. Model generates images from validation set
3. Same MSE/Huber loss computed on validation samples
4. Loss + images logged to W&B
5. Training resumes
**No manual metrics needed** - everything is handled by the training script!
### Phase 5: Model Evaluation & Publishing
**Test Inference:**
First, install QR code library if needed:
```bash
pip install qrcode[pil]
```
Then run inference:
```python
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
import torch
import qrcode
from PIL import Image
# Generate QR code for testing
print("Generating QR code for https://google.com...")
qr = qrcode.QRCode(
version=1,
error_correction=qrcode.constants.ERROR_CORRECT_H,
box_size=10,
border=4,
)
qr.add_data("https://google.com")
qr.make(fit=True)
# Create QR code image and resize to 1024x1024
qr_image = qr.make_image(fill_color="black", back_color="white")
qr_image = qr_image.resize((1024, 1024), Image.LANCZOS)
print(f"QR code generated: {qr_image.size}")
# Load trained ControlNet
print("Loading ControlNet model...")
controlnet = ControlNetModel.from_pretrained(
"./controlnet-brightness-sdxl/checkpoint-3000", # or checkpoint-1500
torch_dtype=torch.float16
)
# Load SDXL pipeline with ControlNet
print("Loading SDXL pipeline...")
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
controlnet=controlnet,
torch_dtype=torch.float16
)
pipe.enable_xformers_memory_efficient_attention()
pipe.to("cuda")
# Generate artistic QR code
print("Generating artistic QR code...")
image = pipe(
prompt="a beautiful garden scene with flowers, highly detailed, professional photography",
negative_prompt="blurry, low quality, distorted",
image=qr_image,
num_inference_steps=30,
controlnet_conditioning_scale=0.45, # Adjust 0.3-0.6 for balance
guidance_scale=7.5,
).images[0]
# Save results
qr_image.save("original_qr.png")
image.save("artistic_qr_result.png")
print("β
Done! Check artistic_qr_result.png")
print("π± Scan with phone to verify QR code still works!")
```
**Testing Different Conditioning Scales:**
```python
# Test multiple conditioning scales to find best balance
for scale in [0.3, 0.4, 0.5, 0.6]:
print(f"Testing conditioning_scale={scale}...")
image = pipe(
prompt="a beautiful garden scene with flowers",
image=qr_image,
num_inference_steps=30,
controlnet_conditioning_scale=scale,
).images[0]
image.save(f"result_scale_{scale}.png")
```
**Publish to HuggingFace Hub:**
```bash
# After validation
huggingface-cli login
python scripts/upload_to_hub.py \
--model_path="./controlnet-brightness-sdxl/checkpoint-50000" \
--repo_name="Oysiyl/controlnet-brightness-sdxl"
```
## Cost-Benefit Analysis
### Investment Required (Updated for Single H100)
**Strategy A: Free Tier (99k Quick Test)**
| Component | Cost/Time |
|-----------|-----------|
| GPU Credits (99k samples, 2 epochs, single H100) | $1.88 |
| Setup Time | 1-2 hours |
| Training Duration | **45 minutes** β‘ |
| Testing & Validation | 2-3 hours |
| **Total Time** | **~4-6 hours** (same day) |
| **Total Cost** | **$1.88** |
**Strategy B: Pro Plan (Full 3M Training)**
| Component | Cost/Time |
|-----------|-----------|
| Pro Subscription (can cancel after) | $20/month |
| Included credits value | -$13 (240 credits/year) |
| GPU Credits (3M samples, 1 epoch, 6ΓH100) | $60 |
| Setup Time | 1-2 hours |
| Training Duration | **4 hours** β‘ |
| Testing & Validation | 2-3 hours |
| **Total Time** | **~8 hours** (same day) |
| **Total Cost** | **$80** ($20 sub + $60 GPU) |
| **Net Cost** | **$67** (after annual credit value) |
**Strategy C: All-in-One (Pro Plan, Test Everything)**
| Component | Cost/Time |
|-----------|-----------|
| Pro Subscription | $20/month |
| 99k test (6ΓH100) | $1.88 (7.5 min) |
| 500k training (6ΓH100) | $10 (40 min) |
| 3M training (6ΓH100) | $60 (4 hours) |
| **Total GPU Time** | **~5 hours** |
| **Total GPU Cost** | **$71.88** |
| **Total with Sub** | **$91.88** |
| **Net after credits** | **$78.88** |
**Recommendation:** Start with Strategy A ($1.88), upgrade to Strategy B if promising
### Value Delivered
1. **Unblocks SDXL Migration**: Enables upgrade from SD 1.5 to higher quality SDXL
2. **Better Image Quality**: SDXL produces superior 1024Γ1024 images vs SD 1.5's 512Γ512
3. **Community Value**: First public SDXL brightness ControlNet (potential citations/recognition)
4. **No Alternatives**: Cannot proceed with SDXL QR code generation without this model
5. **Reusable Asset**: Once trained, can be used indefinitely
### Risk Mitigation
- **Start Small**: Train on 100k samples first (~$40, 1-2 days)
- **Evaluate Early**: Check quality at checkpoint-5000, checkpoint-10000
- **Iterative Approach**: Extend training only if initial results are promising
- **Fallback**: Can continue using SD 1.5 if SDXL training fails
## Alternative Approaches Considered
### Option 1: Train Brightness ControlNet for SDXL (RECOMMENDED)
- **Pros**:
- Proven training pipeline (diffusers script exists)
- Same dataset as original SD 1.5 model
- Good quality/cost balance
- Community support and documentation
- License-friendly (SDXL is permissive)
- **Cons**:
- Requires GPU time investment ($75-$300)
- 4-5 days training duration
- Still requires 24GB+ VRAM for inference
- **Cost**: $155 for 500k samples on A100 (recommended)
- **Risk**: Low - well-documented process
- **Verdict**: β
**Best choice for production use**
### Option 2: Train Brightness ControlNet for Flux Schnell
- **Pros**:
- Apache 2.0 license (fully commercial)
- Faster inference than Flux Dev (3Γ speedup)
- Same architecture as Dev (12B parameters)
- Would be first-of-its-kind community contribution
- **Cons**:
- β οΈ **No existing training scripts for Schnell**
- Would need to adapt Flux Dev training code
- Unknown if distillation affects ControlNet training
- Still requires 32-40GB VRAM (heavier than SDXL)
- Higher risk and uncertainty
- Longer training time due to larger model
- **Cost**: $200-$500 (estimated, higher due to larger model)
- **Risk**: High - experimental, no precedent
- **Verdict**: π¬ **Experimental - only if willing to pioneer new territory**
### Option 3: Use SDXL LoRA for Brightness Control
- **Pros**: No training required, immediate availability
- **Cons**: Less precise control than dedicated ControlNet, may not work well for QR codes
- **Verdict**: Worth testing but likely insufficient for QR code use case
### Option 4: Latent Initialization Approach
- **Pros**: Architecture-agnostic, works with both SDXL and Flux
- **Cons**: Less control over brightness distribution, requires experimentation
- **Verdict**: Good fallback but not as reliable as ControlNet
### Option 5: Wait for Community Release
- **Pros**: Zero cost, zero effort
- **Cons**: No timeline, may never happen, blocks project progress
- **Verdict**: Not viable for active development
### Option 6: Hybrid Tile ControlNet + Post-Processing
- **Pros**: Tile ControlNet available for SDXL
- **Cons**: Doesn't address brightness control directly
- **Verdict**: Complementary but not a replacement
**Conclusion**: Training SDXL ControlNet is the most reliable solution. Flux Schnell is interesting for research but carries significant execution risk.
## Recommended Action Plan
### Immediate Setup (Day 1)
1. **Launch Lightning AI Instance**: A100 40GB GPU
2. **Run Setup Commands**: Install all dependencies (see Phase 3 above)
3. **Authenticate**: HuggingFace and W&B login
4. **Clone Diffusers**: Get training scripts
### Training Phase (Day 1 - Morning) β‘
5. **Start Training**: Launch training with 99k samples (~45 minutes on 8ΓH100)
6. **Monitor W&B**: Track loss curves and validation images in real-time
7. **First Checkpoint**: Review checkpoint-1500 (~25 minutes in)
8. **Training Complete**: Total ~45 minutes for full 2-epoch run
### Evaluation Phase (Day 1 - Afternoon)
9. **Post-Training Validation**: Run inference on 1k validation set
10. **QR Code Testing**: Test with actual QR codes, measure scannability
11. **Quality Assessment**: Compare to SD 1.5 brightness ControlNet
12. **Decision Point**:
- If quality good: Publish and integrate (move to next phase)
- If needs improvement: Launch 2nd training run with adjusted hyperparameters (~45 min)
- Can try 3-4 different configurations in same day!
### Optional: Full Dataset Training (Day 1 - Evening)
12a. **If 99k results promising**: Launch full 3M training (~2 hours on 8ΓH100)
12b. **Monitor overnight**: W&B tracks progress automatically
12c. **Next morning**: Evaluate final model quality
### Integration Phase (Day 2)
13. **Publish to HuggingFace**: Upload best checkpoint
14. **Update app_sdxl.py**: Integrate new ControlNet model
15. **Production Testing**: End-to-end QR code generation tests
16. **Documentation**: Update README with SDXL support
**Total Timeline: 1-2 days** (vs previous estimate of 5 days)
## Success Metrics
1. **QR Code Scannability**: 95%+ scan rate on generated images
2. **Visual Quality**: Subjective improvement over SD 1.5 outputs
3. **Control Precision**: Ability to adjust brightness strength (0.0-1.0 range)
4. **Training Loss**: Convergence to < 0.1 validation loss
5. **Community Adoption**: Positive feedback if published publicly
## Critical Files to Modify
Once model is trained:
- `app.py:48-56` - Add SDXL ControlNet loading
- `app.py:1880-1886` - Update standard pipeline with SDXL support
- `app.py:2343-2349` - Update artistic pipeline with SDXL support
- `app_sdxl.py` - Complete SDXL-specific implementation
- `comfy/sd_configs/` - Add SDXL configuration if needed
## Flux Schnell Training Considerations (If Pursuing)
If you decide to pursue Flux Schnell ControlNet training despite the risks:
**Required Adaptations:**
1. **Training Script Modification**: Adapt `train_controlnet_flux.py` to work with Schnell
- Model path: `black-forest-labs/FLUX.1-schnell` instead of `FLUX.1-dev`
- Verify architecture compatibility (distillation may affect ControlNet layers)
- Test with small pilot run (1000 steps) before full training
2. **Hardware Requirements**:
- Minimum: H100 (80GB VRAM) - $1.99/hr
- A100 40GB likely insufficient for Flux training
- Estimated training: 150-250 hours on H100 (~$300-$500)
3. **Dataset Considerations**:
- Flux uses 1024Γ1024 resolution (same as SDXL)
- Dataset would need upscaling from 512Γ512 or re-preprocessing
- Consider starting with 100k subset for validation
4. **Verification Steps**:
- Test if Schnell's distillation preserves ControlNet training capability
- Compare with Flux Dev training (if available for testing)
- Validate brightness control precision matches SD 1.5 quality
**Risk Assessment**:
- **Technical Risk**: High - no proven training path
- **Time Risk**: Medium-High - debugging could extend timeline significantly
- **Cost Risk**: High - may require multiple training attempts ($500+)
- **Success Probability**: 50-70% (educated guess based on architecture similarity)
**Recommendation**: Only pursue if:
1. SDXL training completes successfully first (de-risk approach)
2. You're willing to contribute pioneering work to the community
3. Budget allows for experimental work ($500-1000 total including failed attempts)
## References
### SDXL Training
- **SDXL Training Script**: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sdxl.py
- **Dataset**: https://huggingface.co/datasets/latentcat/grayscale_image_aesthetic_3M
- **Reference Article**: https://latentcat.com/en/blog/brightness-controlnet
- **Original SD 1.5 Model**: https://huggingface.co/latentcat/latentcat-controlnet
- **Lightning AI**: https://lightning.ai/
### Flux Information
- **Flux Schnell Model**: https://huggingface.co/black-forest-labs/FLUX.1-schnell
- **Flux Dev Training Script**: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_flux.py
- **XLabs-AI Flux ControlNets**: https://huggingface.co/XLabs-AI/flux-controlnet-collections
- **Flux Comparison Guide**: [Flux Dev vs Schnell Comparison](https://www.stablediffusiontutorials.com/2025/04/flux-schnell-dev-pro.html)
- **Flux Architecture Discussion**: [GitHub Issue #408](https://github.com/black-forest-labs/flux/issues/408)
- **License Comparison**: [Flux Model Guide](https://stable-diffusion-art.com/flux/)
## Final Recommendation (Updated December 2024 - Lightning.ai)
**Proceed with SDXL Brightness ControlNet Training on Single H100 (Free Tier)**
Based on Lightning.ai pricing and multi-GPU requirements, the recommended path is:
### Phase 1: Quick Validation (Free Tier)
1. **Start with 99k samples on single H100**
- Cost: $1.88 in GPU credits
- Duration: 45 minutes
- Platform: Lightning.ai Free tier
- Purpose: Validate training pipeline and quality
### Phase 2: Production Training (Choose Based on Phase 1)
**Option A: Budget Approach (Free Tier)**
- Run full 3M dataset on single H100
- Cost: $60 GPU credits, $0 subscription
- Duration: 24 hours
- Total: $60
- Best for: One-time training, have patience
**Option B: Speed Approach (Pro Plan)**
- Upgrade to Pro plan ($20/month annual)
- Run full 3M dataset on 6Γ H100
- Cost: $60 GPU + $20 subscription = $80
- Net cost: $67 (after $13 annual credit value)
- Duration: 4 hours
- Best for: Need results same day, may iterate
### Recommended Strategy
**Most Cost-Effective Path:**
1. **Day 1 Morning**: Run 99k test on Free tier ($1.88, 45 min)
2. **Day 1 Afternoon**: Evaluate results
3. **If promising**:
- **Budget route**: Start 3M on Free tier ($60, 24 hrs) β Total: $61.88
- **Speed route**: Upgrade to Pro, run 3M ($80, 4 hrs) β Total: $81.88
4. **Cancel Pro** after training if using speed route
### Why This Path
- **Low Risk Entry**: Only $1.88 to validate entire pipeline
- **Flexible Scaling**: Choose speed vs cost based on results
- **Proven Pipeline**: HuggingFace Diffusers battle-tested script
- **Reference Success**: Original SD 1.5 model trained on same dataset
- **H100 Advantage**: 6.3Γ faster than A100 even on single GPU
- **Cost-Effective**: $62-$82 total (vs $900+ on older plans)
- **Unblocks Migration**: Enables full SDXL upgrade from SD 1.5
### Cost Breakdown Comparison
| Approach | Hardware | Duration | GPU Cost | Sub Cost | Total | Timeline |
|----------|----------|----------|----------|----------|-------|----------|
| **Old Plan (A100)** | Single A100 | 180 hours | $900-1,200 | $0 | $900-1,200 | 1 week |
| **NEW: Free Tier** | Single H100 | 24.75 hours | $61.88 | $0 | **$61.88** | 2 days |
| **NEW: Pro Plan** | 6Γ H100 | 4.75 hours | $61.88 | $20 | **$81.88** | 1 day |
**Savings vs Old Plan:**
- Free tier: Save $838-$1,138 and 6 days
- Pro plan: Save $818-$1,118 and 6 days
### Pro Plan ROI Analysis
**When is Pro worth it?**
- $20 extra to save 20 hours (24h β 4h)
- = **$1/hour saved**
- Plus: Can test multiple hyperparameters same day
- Plus: Includes $13/year in credits
**Get Pro if:**
- β
You value time over $1/hour
- β
Planning to iterate on hyperparameters
- β
Need results urgently
- β
Want to test 99k + 500k + 3M in one session
**Skip Pro if:**
- β
Doing one-time training only
- β
Can wait 24 hours
- β
Budget constrained
- β
99k test was sufficient
### Next Steps
Once plan is approved:
1. Set up Lightning AI account with A100 GPU access
2. Clone diffusers repository and install requirements
3. Verify dataset access and download capabilities
4. Prepare validation QR codes for quality testing
5. Launch training with recommended hyperparameters
6. Monitor via Weights & Biases for loss curves and validation images
7. Evaluate checkpoints at 10k, 25k, 50k steps
8. Complete training and publish to HuggingFace Hub
9. Integrate into `app_sdxl.py` for production use
**Flux Schnell** remains an option for future exploration once SDXL is production-ready, but is deprioritized due to experimental nature and higher resource requirements.
|