Oysiyl commited on
Commit
82b2bf3
·
1 Parent(s): 4670d25

Add comprehensive accelerate and H100 optimization guide

Browse files

Added sections:
- Accelerate configuration for single vs multi-GPU
- H100-optimized training parameters and batch sizes
- Single H100 optimized command (batch_size=32, 99k in 45min)
- 6× H100 multi-GPU command (batch_size=24/GPU, 3M in 4hrs)
- Batch size selection guide for different GPU configs
- Memory optimization tips and OOM troubleshooting

Key recommendations:
- Single H100: batch_size=32, grad_accum=4 (effective=128)
- 6× H100: batch_size=24/GPU, grad_accum=2 (effective=288)
- Added dataloader_num_workers=8 for faster data loading
- Added set_grads_to_none for faster gradient zeroing
- More frequent checkpointing for H100 (every 750 steps)

SDXL_ControlNet_Brightness_Training_Plan.md CHANGED
@@ -724,6 +724,245 @@ The settings above are optimized for memory efficiency:
724
  ```
725
  This keeps effective batch size = 8 × 4 = 32 (half of 64), but still works well.
726
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
727
  ### Full 3M Dataset Training Options
728
 
729
  **For maximum quality training on the complete dataset:**
 
724
  ```
725
  This keeps effective batch size = 8 × 4 = 32 (half of 64), but still works well.
726
 
727
+ ### Accelerate Configuration for Multi-GPU Training
728
+
729
+ **Important:** Multi-GPU training on Lightning.ai requires the Pro plan ($20/month annual).
730
+
731
+ #### Single GPU (Free Tier) - No Configuration Needed
732
+
733
+ For single GPU training on Free tier, `accelerate launch` works without any configuration:
734
+
735
+ ```bash
736
+ # No accelerate config needed - auto-detects single GPU
737
+ accelerate launch train_controlnet_sdxl.py [args...]
738
+ ```
739
+
740
+ #### Multi-GPU (Pro Plan) - Configure Before Training
741
+
742
+ For 6× H100 training on Pro plan, configure accelerate once:
743
+
744
+ ```bash
745
+ # Run configuration wizard
746
+ accelerate config
747
+ ```
748
+
749
+ **Configuration Options for 6× H100:**
750
+
751
+ ```yaml
752
+ compute_environment: LOCAL_MACHINE
753
+ distributed_type: MULTI_GPU # Use DataParallel for multiple GPUs
754
+ num_machines: 1 # Single machine with 6 GPUs
755
+ num_processes: 6 # One process per GPU
756
+ gpu_ids: all # Use all available GPUs
757
+ mixed_precision: fp16 # Match training script
758
+ use_cpu: false
759
+ dynamo_backend: NO # Disable torch.compile for compatibility
760
+ ```
761
+
762
+ **Quick Config (Non-Interactive):**
763
+
764
+ ```bash
765
+ # Create accelerate config file directly
766
+ cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'EOF'
767
+ compute_environment: LOCAL_MACHINE
768
+ distributed_type: MULTI_GPU
769
+ num_machines: 1
770
+ num_processes: 6
771
+ gpu_ids: all
772
+ mixed_precision: fp16
773
+ use_cpu: false
774
+ dynamo_backend: NO
775
+ EOF
776
+ ```
777
+
778
+ **Verify Configuration:**
779
+
780
+ ```bash
781
+ # Check configuration
782
+ accelerate env
783
+
784
+ # Test multi-GPU setup
785
+ accelerate test
786
+ ```
787
+
788
+ **Launch Multi-GPU Training:**
789
+
790
+ ```bash
791
+ # With configuration file, launch works same as single GPU
792
+ accelerate launch train_controlnet_sdxl.py [args...]
793
+
794
+ # Or specify config explicitly
795
+ accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \
796
+ train_controlnet_sdxl.py [args...]
797
+ ```
798
+
799
+ ### H100-Optimized Training Parameters
800
+
801
+ The H100 GPU has **80GB VRAM** and **1979 TFLOPs**, allowing for larger batch sizes and better optimization than A100.
802
+
803
+ #### Optimal Batch Size for H100
804
+
805
+ **Default settings (designed for A100 40GB):**
806
+ ```bash
807
+ --train_batch_size=16
808
+ --gradient_accumulation_steps=4
809
+ # Effective batch size: 16 × 4 = 64 samples/step
810
+ # VRAM usage: ~22-28GB
811
+ ```
812
+
813
+ **H100-optimized settings (80GB VRAM):**
814
+ ```bash
815
+ --train_batch_size=32 # 2× larger than A100
816
+ --gradient_accumulation_steps=4
817
+ # Effective batch size: 32 × 4 = 128 samples/step
818
+ # VRAM usage: ~40-48GB (still plenty of headroom)
819
+ ```
820
+
821
+ **Aggressive H100 settings (maximum throughput):**
822
+ ```bash
823
+ --train_batch_size=48 # 3× larger than A100
824
+ --gradient_accumulation_steps=2 # Reduce accumulation since batch is larger
825
+ # Effective batch size: 48 × 2 = 96 samples/step
826
+ # VRAM usage: ~55-65GB
827
+ # Faster training due to fewer gradient accumulation steps
828
+ ```
829
+
830
+ #### Single H100 Training Command (99k samples)
831
+
832
+ **Optimized for H100 80GB:**
833
+
834
+ ```bash
835
+ export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
836
+ export OUTPUT_DIR="./controlnet-brightness-sdxl-h100"
837
+
838
+ accelerate launch train_controlnet_sdxl.py \
839
+ --pretrained_model_name_or_path=$MODEL_DIR \
840
+ --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
841
+ --max_train_samples=99000 \
842
+ --conditioning_image_column="conditioning_image" \
843
+ --image_column="image" \
844
+ --caption_column="text" \
845
+ --output_dir=$OUTPUT_DIR \
846
+ --mixed_precision="fp16" \
847
+ --resolution=512 \
848
+ --learning_rate=1e-5 \
849
+ --train_batch_size=32 \
850
+ --gradient_accumulation_steps=4 \
851
+ --num_train_epochs=2 \
852
+ --checkpointing_steps=750 \
853
+ --validation_steps=750 \
854
+ --tracker_project_name="brightness-controlnet-sdxl-h100" \
855
+ --report_to="wandb" \
856
+ --enable_xformers_memory_efficient_attention \
857
+ --gradient_checkpointing \
858
+ --use_8bit_adam \
859
+ --dataloader_num_workers=8 \
860
+ --set_grads_to_none
861
+ ```
862
+
863
+ **Key H100 Optimizations:**
864
+ - `--train_batch_size=32` (vs 16 on A100) - 2× larger batches
865
+ - `--gradient_accumulation_steps=4` - Effective batch = 128
866
+ - `--checkpointing_steps=750` - More frequent (every ~96k samples)
867
+ - `--dataloader_num_workers=8` - Faster data loading (H100 has 192 CPUs)
868
+ - `--set_grads_to_none` - Faster than zero_grad() on modern GPUs
869
+
870
+ **Expected Performance:**
871
+ - Steps per epoch: 99,000 ÷ 128 = 773 steps
872
+ - Total steps (2 epochs): ~1,546 steps
873
+ - Training time: ~38-45 minutes on single H100
874
+ - Checkpoints saved at: 750, 1500 steps
875
+
876
+ #### 6× H100 Training Command (3M samples) - Pro Plan
877
+
878
+ **For Pro plan multi-GPU training:**
879
+
880
+ ```bash
881
+ export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
882
+ export OUTPUT_DIR="./controlnet-brightness-sdxl-multi-h100"
883
+
884
+ # Configure accelerate for 6 GPUs (if not done already)
885
+ accelerate config # Select MULTI_GPU, 6 processes
886
+
887
+ # Launch training
888
+ accelerate launch train_controlnet_sdxl.py \
889
+ --pretrained_model_name_or_path=$MODEL_DIR \
890
+ --dataset_name="latentcat/grayscale_image_aesthetic_3M" \
891
+ --max_train_samples=2999000 \
892
+ --conditioning_image_column="conditioning_image" \
893
+ --image_column="image" \
894
+ --caption_column="text" \
895
+ --output_dir=$OUTPUT_DIR \
896
+ --mixed_precision="fp16" \
897
+ --resolution=512 \
898
+ --learning_rate=1e-5 \
899
+ --train_batch_size=24 \
900
+ --gradient_accumulation_steps=2 \
901
+ --num_train_epochs=1 \
902
+ --checkpointing_steps=2500 \
903
+ --validation_steps=2500 \
904
+ --tracker_project_name="brightness-controlnet-sdxl-3M" \
905
+ --report_to="wandb" \
906
+ --enable_xformers_memory_efficient_attention \
907
+ --gradient_checkpointing \
908
+ --use_8bit_adam \
909
+ --dataloader_num_workers=8 \
910
+ --set_grads_to_none \
911
+ --resume_from_checkpoint="latest"
912
+ ```
913
+
914
+ **Multi-GPU Optimizations:**
915
+ - `--train_batch_size=24` per GPU × 6 GPUs = 144 samples per step (before accumulation)
916
+ - `--gradient_accumulation_steps=2` - Effective batch = 144 × 2 = 288
917
+ - `--checkpointing_steps=2500` - Save every ~720k samples
918
+ - `--resume_from_checkpoint="latest"` - Auto-resume if interrupted
919
+
920
+ **Expected Performance:**
921
+ - Effective batch size: 288 samples/step
922
+ - Steps per epoch: 2,999,000 ÷ 288 = ~10,413 steps
923
+ - Training time: ~4 hours on 6× H100
924
+ - Checkpoints: 2500, 5000, 7500, 10000 steps + final
925
+
926
+ #### Batch Size Selection Guide
927
+
928
+ | GPU Config | VRAM | Recommended batch_size | grad_accum_steps | Effective Batch | Training Speed |
929
+ |------------|------|------------------------|------------------|-----------------|----------------|
930
+ | Single L4 | 24GB | 8 | 4 | 32 | Slow (baseline) |
931
+ | Single A100 | 40GB | 16 | 4 | 64 | 2× faster than L4 |
932
+ | Single H100 | 80GB | 32 | 4 | 128 | 6× faster than L4 |
933
+ | 6× H100 (Pro) | 480GB | 24/GPU | 2 | 288 | 36× faster than L4 |
934
+
935
+ **Rule of Thumb:**
936
+ - Larger `train_batch_size` = better GPU utilization, faster training
937
+ - Larger `effective_batch_size` = more stable training, better convergence
938
+ - H100 can handle 2-3× larger batch sizes than A100 with same settings
939
+
940
+ #### Memory Optimization Tips
941
+
942
+ **If you encounter OOM (Out of Memory) errors on H100:**
943
+
944
+ 1. **Reduce batch size incrementally:**
945
+ ```bash
946
+ --train_batch_size=32 # Start here
947
+ --train_batch_size=24 # If OOM
948
+ --train_batch_size=16 # If still OOM
949
+ ```
950
+
951
+ 2. **Enable additional memory optimizations:**
952
+ ```bash
953
+ --gradient_checkpointing \ # Already enabled
954
+ --use_8bit_adam \ # Already enabled
955
+ --enable_xformers_memory_efficient_attention \ # Already enabled
956
+ --set_grads_to_none \ # Use this instead of zero_grad()
957
+ ```
958
+
959
+ 3. **Use gradient accumulation to maintain effective batch size:**
960
+ ```bash
961
+ # If reducing from batch_size=32 to batch_size=16
962
+ --train_batch_size=16
963
+ --gradient_accumulation_steps=8 # Double accumulation to keep effective=128
964
+ ```
965
+
966
  ### Full 3M Dataset Training Options
967
 
968
  **For maximum quality training on the complete dataset:**