Spaces:
Sleeping
Sleeping
Add comprehensive accelerate and H100 optimization guide
Browse filesAdded sections:
- Accelerate configuration for single vs multi-GPU
- H100-optimized training parameters and batch sizes
- Single H100 optimized command (batch_size=32, 99k in 45min)
- 6× H100 multi-GPU command (batch_size=24/GPU, 3M in 4hrs)
- Batch size selection guide for different GPU configs
- Memory optimization tips and OOM troubleshooting
Key recommendations:
- Single H100: batch_size=32, grad_accum=4 (effective=128)
- 6× H100: batch_size=24/GPU, grad_accum=2 (effective=288)
- Added dataloader_num_workers=8 for faster data loading
- Added set_grads_to_none for faster gradient zeroing
- More frequent checkpointing for H100 (every 750 steps)
SDXL_ControlNet_Brightness_Training_Plan.md
CHANGED
|
@@ -724,6 +724,245 @@ The settings above are optimized for memory efficiency:
|
|
| 724 |
```
|
| 725 |
This keeps effective batch size = 8 × 4 = 32 (half of 64), but still works well.
|
| 726 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 727 |
### Full 3M Dataset Training Options
|
| 728 |
|
| 729 |
**For maximum quality training on the complete dataset:**
|
|
|
|
| 724 |
```
|
| 725 |
This keeps effective batch size = 8 × 4 = 32 (half of 64), but still works well.
|
| 726 |
|
| 727 |
+
### Accelerate Configuration for Multi-GPU Training
|
| 728 |
+
|
| 729 |
+
**Important:** Multi-GPU training on Lightning.ai requires the Pro plan ($20/month annual).
|
| 730 |
+
|
| 731 |
+
#### Single GPU (Free Tier) - No Configuration Needed
|
| 732 |
+
|
| 733 |
+
For single GPU training on Free tier, `accelerate launch` works without any configuration:
|
| 734 |
+
|
| 735 |
+
```bash
|
| 736 |
+
# No accelerate config needed - auto-detects single GPU
|
| 737 |
+
accelerate launch train_controlnet_sdxl.py [args...]
|
| 738 |
+
```
|
| 739 |
+
|
| 740 |
+
#### Multi-GPU (Pro Plan) - Configure Before Training
|
| 741 |
+
|
| 742 |
+
For 6× H100 training on Pro plan, configure accelerate once:
|
| 743 |
+
|
| 744 |
+
```bash
|
| 745 |
+
# Run configuration wizard
|
| 746 |
+
accelerate config
|
| 747 |
+
```
|
| 748 |
+
|
| 749 |
+
**Configuration Options for 6× H100:**
|
| 750 |
+
|
| 751 |
+
```yaml
|
| 752 |
+
compute_environment: LOCAL_MACHINE
|
| 753 |
+
distributed_type: MULTI_GPU # Use DataParallel for multiple GPUs
|
| 754 |
+
num_machines: 1 # Single machine with 6 GPUs
|
| 755 |
+
num_processes: 6 # One process per GPU
|
| 756 |
+
gpu_ids: all # Use all available GPUs
|
| 757 |
+
mixed_precision: fp16 # Match training script
|
| 758 |
+
use_cpu: false
|
| 759 |
+
dynamo_backend: NO # Disable torch.compile for compatibility
|
| 760 |
+
```
|
| 761 |
+
|
| 762 |
+
**Quick Config (Non-Interactive):**
|
| 763 |
+
|
| 764 |
+
```bash
|
| 765 |
+
# Create accelerate config file directly
|
| 766 |
+
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'EOF'
|
| 767 |
+
compute_environment: LOCAL_MACHINE
|
| 768 |
+
distributed_type: MULTI_GPU
|
| 769 |
+
num_machines: 1
|
| 770 |
+
num_processes: 6
|
| 771 |
+
gpu_ids: all
|
| 772 |
+
mixed_precision: fp16
|
| 773 |
+
use_cpu: false
|
| 774 |
+
dynamo_backend: NO
|
| 775 |
+
EOF
|
| 776 |
+
```
|
| 777 |
+
|
| 778 |
+
**Verify Configuration:**
|
| 779 |
+
|
| 780 |
+
```bash
|
| 781 |
+
# Check configuration
|
| 782 |
+
accelerate env
|
| 783 |
+
|
| 784 |
+
# Test multi-GPU setup
|
| 785 |
+
accelerate test
|
| 786 |
+
```
|
| 787 |
+
|
| 788 |
+
**Launch Multi-GPU Training:**
|
| 789 |
+
|
| 790 |
+
```bash
|
| 791 |
+
# With configuration file, launch works same as single GPU
|
| 792 |
+
accelerate launch train_controlnet_sdxl.py [args...]
|
| 793 |
+
|
| 794 |
+
# Or specify config explicitly
|
| 795 |
+
accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \
|
| 796 |
+
train_controlnet_sdxl.py [args...]
|
| 797 |
+
```
|
| 798 |
+
|
| 799 |
+
### H100-Optimized Training Parameters
|
| 800 |
+
|
| 801 |
+
The H100 GPU has **80GB VRAM** and **1979 TFLOPs**, allowing for larger batch sizes and better optimization than A100.
|
| 802 |
+
|
| 803 |
+
#### Optimal Batch Size for H100
|
| 804 |
+
|
| 805 |
+
**Default settings (designed for A100 40GB):**
|
| 806 |
+
```bash
|
| 807 |
+
--train_batch_size=16
|
| 808 |
+
--gradient_accumulation_steps=4
|
| 809 |
+
# Effective batch size: 16 × 4 = 64 samples/step
|
| 810 |
+
# VRAM usage: ~22-28GB
|
| 811 |
+
```
|
| 812 |
+
|
| 813 |
+
**H100-optimized settings (80GB VRAM):**
|
| 814 |
+
```bash
|
| 815 |
+
--train_batch_size=32 # 2× larger than A100
|
| 816 |
+
--gradient_accumulation_steps=4
|
| 817 |
+
# Effective batch size: 32 × 4 = 128 samples/step
|
| 818 |
+
# VRAM usage: ~40-48GB (still plenty of headroom)
|
| 819 |
+
```
|
| 820 |
+
|
| 821 |
+
**Aggressive H100 settings (maximum throughput):**
|
| 822 |
+
```bash
|
| 823 |
+
--train_batch_size=48 # 3× larger than A100
|
| 824 |
+
--gradient_accumulation_steps=2 # Reduce accumulation since batch is larger
|
| 825 |
+
# Effective batch size: 48 × 2 = 96 samples/step
|
| 826 |
+
# VRAM usage: ~55-65GB
|
| 827 |
+
# Faster training due to fewer gradient accumulation steps
|
| 828 |
+
```
|
| 829 |
+
|
| 830 |
+
#### Single H100 Training Command (99k samples)
|
| 831 |
+
|
| 832 |
+
**Optimized for H100 80GB:**
|
| 833 |
+
|
| 834 |
+
```bash
|
| 835 |
+
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
|
| 836 |
+
export OUTPUT_DIR="./controlnet-brightness-sdxl-h100"
|
| 837 |
+
|
| 838 |
+
accelerate launch train_controlnet_sdxl.py \
|
| 839 |
+
--pretrained_model_name_or_path=$MODEL_DIR \
|
| 840 |
+
--dataset_name="latentcat/grayscale_image_aesthetic_3M" \
|
| 841 |
+
--max_train_samples=99000 \
|
| 842 |
+
--conditioning_image_column="conditioning_image" \
|
| 843 |
+
--image_column="image" \
|
| 844 |
+
--caption_column="text" \
|
| 845 |
+
--output_dir=$OUTPUT_DIR \
|
| 846 |
+
--mixed_precision="fp16" \
|
| 847 |
+
--resolution=512 \
|
| 848 |
+
--learning_rate=1e-5 \
|
| 849 |
+
--train_batch_size=32 \
|
| 850 |
+
--gradient_accumulation_steps=4 \
|
| 851 |
+
--num_train_epochs=2 \
|
| 852 |
+
--checkpointing_steps=750 \
|
| 853 |
+
--validation_steps=750 \
|
| 854 |
+
--tracker_project_name="brightness-controlnet-sdxl-h100" \
|
| 855 |
+
--report_to="wandb" \
|
| 856 |
+
--enable_xformers_memory_efficient_attention \
|
| 857 |
+
--gradient_checkpointing \
|
| 858 |
+
--use_8bit_adam \
|
| 859 |
+
--dataloader_num_workers=8 \
|
| 860 |
+
--set_grads_to_none
|
| 861 |
+
```
|
| 862 |
+
|
| 863 |
+
**Key H100 Optimizations:**
|
| 864 |
+
- `--train_batch_size=32` (vs 16 on A100) - 2× larger batches
|
| 865 |
+
- `--gradient_accumulation_steps=4` - Effective batch = 128
|
| 866 |
+
- `--checkpointing_steps=750` - More frequent (every ~96k samples)
|
| 867 |
+
- `--dataloader_num_workers=8` - Faster data loading (H100 has 192 CPUs)
|
| 868 |
+
- `--set_grads_to_none` - Faster than zero_grad() on modern GPUs
|
| 869 |
+
|
| 870 |
+
**Expected Performance:**
|
| 871 |
+
- Steps per epoch: 99,000 ÷ 128 = 773 steps
|
| 872 |
+
- Total steps (2 epochs): ~1,546 steps
|
| 873 |
+
- Training time: ~38-45 minutes on single H100
|
| 874 |
+
- Checkpoints saved at: 750, 1500 steps
|
| 875 |
+
|
| 876 |
+
#### 6× H100 Training Command (3M samples) - Pro Plan
|
| 877 |
+
|
| 878 |
+
**For Pro plan multi-GPU training:**
|
| 879 |
+
|
| 880 |
+
```bash
|
| 881 |
+
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
|
| 882 |
+
export OUTPUT_DIR="./controlnet-brightness-sdxl-multi-h100"
|
| 883 |
+
|
| 884 |
+
# Configure accelerate for 6 GPUs (if not done already)
|
| 885 |
+
accelerate config # Select MULTI_GPU, 6 processes
|
| 886 |
+
|
| 887 |
+
# Launch training
|
| 888 |
+
accelerate launch train_controlnet_sdxl.py \
|
| 889 |
+
--pretrained_model_name_or_path=$MODEL_DIR \
|
| 890 |
+
--dataset_name="latentcat/grayscale_image_aesthetic_3M" \
|
| 891 |
+
--max_train_samples=2999000 \
|
| 892 |
+
--conditioning_image_column="conditioning_image" \
|
| 893 |
+
--image_column="image" \
|
| 894 |
+
--caption_column="text" \
|
| 895 |
+
--output_dir=$OUTPUT_DIR \
|
| 896 |
+
--mixed_precision="fp16" \
|
| 897 |
+
--resolution=512 \
|
| 898 |
+
--learning_rate=1e-5 \
|
| 899 |
+
--train_batch_size=24 \
|
| 900 |
+
--gradient_accumulation_steps=2 \
|
| 901 |
+
--num_train_epochs=1 \
|
| 902 |
+
--checkpointing_steps=2500 \
|
| 903 |
+
--validation_steps=2500 \
|
| 904 |
+
--tracker_project_name="brightness-controlnet-sdxl-3M" \
|
| 905 |
+
--report_to="wandb" \
|
| 906 |
+
--enable_xformers_memory_efficient_attention \
|
| 907 |
+
--gradient_checkpointing \
|
| 908 |
+
--use_8bit_adam \
|
| 909 |
+
--dataloader_num_workers=8 \
|
| 910 |
+
--set_grads_to_none \
|
| 911 |
+
--resume_from_checkpoint="latest"
|
| 912 |
+
```
|
| 913 |
+
|
| 914 |
+
**Multi-GPU Optimizations:**
|
| 915 |
+
- `--train_batch_size=24` per GPU × 6 GPUs = 144 samples per step (before accumulation)
|
| 916 |
+
- `--gradient_accumulation_steps=2` - Effective batch = 144 × 2 = 288
|
| 917 |
+
- `--checkpointing_steps=2500` - Save every ~720k samples
|
| 918 |
+
- `--resume_from_checkpoint="latest"` - Auto-resume if interrupted
|
| 919 |
+
|
| 920 |
+
**Expected Performance:**
|
| 921 |
+
- Effective batch size: 288 samples/step
|
| 922 |
+
- Steps per epoch: 2,999,000 ÷ 288 = ~10,413 steps
|
| 923 |
+
- Training time: ~4 hours on 6× H100
|
| 924 |
+
- Checkpoints: 2500, 5000, 7500, 10000 steps + final
|
| 925 |
+
|
| 926 |
+
#### Batch Size Selection Guide
|
| 927 |
+
|
| 928 |
+
| GPU Config | VRAM | Recommended batch_size | grad_accum_steps | Effective Batch | Training Speed |
|
| 929 |
+
|------------|------|------------------------|------------------|-----------------|----------------|
|
| 930 |
+
| Single L4 | 24GB | 8 | 4 | 32 | Slow (baseline) |
|
| 931 |
+
| Single A100 | 40GB | 16 | 4 | 64 | 2× faster than L4 |
|
| 932 |
+
| Single H100 | 80GB | 32 | 4 | 128 | 6× faster than L4 |
|
| 933 |
+
| 6× H100 (Pro) | 480GB | 24/GPU | 2 | 288 | 36× faster than L4 |
|
| 934 |
+
|
| 935 |
+
**Rule of Thumb:**
|
| 936 |
+
- Larger `train_batch_size` = better GPU utilization, faster training
|
| 937 |
+
- Larger `effective_batch_size` = more stable training, better convergence
|
| 938 |
+
- H100 can handle 2-3× larger batch sizes than A100 with same settings
|
| 939 |
+
|
| 940 |
+
#### Memory Optimization Tips
|
| 941 |
+
|
| 942 |
+
**If you encounter OOM (Out of Memory) errors on H100:**
|
| 943 |
+
|
| 944 |
+
1. **Reduce batch size incrementally:**
|
| 945 |
+
```bash
|
| 946 |
+
--train_batch_size=32 # Start here
|
| 947 |
+
--train_batch_size=24 # If OOM
|
| 948 |
+
--train_batch_size=16 # If still OOM
|
| 949 |
+
```
|
| 950 |
+
|
| 951 |
+
2. **Enable additional memory optimizations:**
|
| 952 |
+
```bash
|
| 953 |
+
--gradient_checkpointing \ # Already enabled
|
| 954 |
+
--use_8bit_adam \ # Already enabled
|
| 955 |
+
--enable_xformers_memory_efficient_attention \ # Already enabled
|
| 956 |
+
--set_grads_to_none \ # Use this instead of zero_grad()
|
| 957 |
+
```
|
| 958 |
+
|
| 959 |
+
3. **Use gradient accumulation to maintain effective batch size:**
|
| 960 |
+
```bash
|
| 961 |
+
# If reducing from batch_size=32 to batch_size=16
|
| 962 |
+
--train_batch_size=16
|
| 963 |
+
--gradient_accumulation_steps=8 # Double accumulation to keep effective=128
|
| 964 |
+
```
|
| 965 |
+
|
| 966 |
### Full 3M Dataset Training Options
|
| 967 |
|
| 968 |
**For maximum quality training on the complete dataset:**
|