soccer-ball-detection / SPEED_OPTIMIZATION_GUIDE.md

Upload SPEED_OPTIMIZATION_GUIDE.md with huggingface_hub

c8d21ee verified about 1 month ago

4.17 kB

	# Training Speed Optimization Guide

	## Current Setup
	- GPU: RTX A5000 (24GB VRAM)
	- Current Config: `configs/training.yaml` (batch_size: 24, num_workers: 4)
	- Fast Config: `configs/training_fast_a5000.yaml` (batch_size: 36, num_workers: 12)

	## Speed Optimizations Applied

	### 1. Batch Size Increase (2-3x speedup potential)
	- Current: 24
	- Optimized: 36
	- Impact: Larger batches = more efficient GPU utilization, fewer iterations per epoch
	- Memory: With mixed precision (FP16), 36 batch size fits comfortably in 24GB VRAM

	### 2. Data Loading Optimization (20-40% speedup)
	- num_workers: 4 → 12 (3x more parallel data loading)
	- prefetch_factor: 2 → 6 (3x more batches prefetched)
	- persistent_workers: false → true (eliminates worker startup overhead)
	- Impact: GPU waits less for data, better utilization

	### 3. Mixed Precision Training (Already enabled, ~2x speedup)
	- Status: ✅ Enabled (`mixed_precision: true`)
	- Impact: ~2x faster training, ~50% less memory usage
	- Note: Critical for tiny object detection - preserves accuracy

	### 4. Memory Format Optimization (10-15% speedup)
	- channels_last: ✅ Enabled
	- Impact: Faster convolution operations on modern GPUs

	### 5. TF32 Precision (1.5x speedup on matmul)
	- Status: ✅ Enabled (`tf32: true`)
	- Impact: Faster matrix multiplications on Ampere GPUs (A5000)

	### 6. Reduced Augmentation Overhead (10-20% speedup)
	- CLAHE: Disabled (minimal accuracy impact)
	- Motion Blur: Disabled (minimal accuracy impact)
	- Impact: Faster data preprocessing

	### 7. Reduced Validation Frequency (2x speedup during validation)
	- val_frequency: 1 → 2 (validate every 2 epochs instead of every epoch)
	- Impact: Less time spent on validation

	### 8. Reduced Logging I/O (5-10% speedup)
	- print_frequency: 20 → 100 (less frequent console output)
	- log_every_n_steps: 50 → 200 (less frequent TensorBoard writes)
	- mlflow_log_models: false (no model artifact logging)

	### 9. Memory Cleanup Optimization (2-5% speedup)
	- memory_cleanup_frequency: 10 → 30 (less frequent cleanup = less overhead)
	- adaptive_adjustment_interval: 50 → 100 (less frequent checks)

	## Expected Speed Improvements

	### Overall Training Speed
	- Current: ~5 hours for 20 epochs
	- Optimized: ~2-2.5 hours for 20 epochs (2-2.5x faster)

	### Per-Epoch Speed
	- Current: ~15 minutes per epoch
	- Optimized: ~6-7 minutes per epoch

	## How to Use

	### Option 1: Use Fast Config Directly
	```bash
	python scripts/train_detr.py --config configs/training_fast_a5000.yaml
	```

	### Option 2: Modify Existing Config
	Edit `configs/training.yaml` and update:
	- `batch_size: 24` → `batch_size: 36`
	- `num_workers: 4` → `num_workers: 12`
	- `prefetch_factor: 2` → `prefetch_factor: 6`
	- `val_frequency: 1` → `val_frequency: 2` (in evaluation section)

	### Option 3: Test Batch Size Incrementally
	If you encounter OOM (Out of Memory) errors:
	1. Start with `batch_size: 32`
	2. If successful, try `batch_size: 36`
	3. If still successful, try `batch_size: 40` (maximum for A5000)

	## Memory Monitoring

	Monitor GPU memory usage during training:
	```bash
	watch -n 1 nvidia-smi
	```

	If you see memory usage > 95%, reduce batch_size by 4-8.

	## Trade-offs

	### Accuracy Impact
	- Minimal: Disabled augmentations (CLAHE, motion blur) have minimal impact
	- None: Other optimizations (batch size, data loading) don't affect accuracy

	### Stability
	- Same: All optimizations maintain training stability
	- Better: Larger batch size = more stable gradients

	## Additional Optimizations (Future)

	1. Gradient Checkpointing: Trade compute for memory (if needed for even larger batches)
	2. Model Compilation: PyTorch 2.0 `torch.compile()` (currently disabled due to variable input sizes)
	3. Data Prefetching: Pre-process data to disk cache (if I/O becomes bottleneck)

	## Notes

	- All optimizations are GPU-safe and tested
	- Mixed precision is critical - keep it enabled
	- Monitor first epoch to ensure no OOM errors
	- If training crashes, reduce batch_size by 4 and retry

	# Training Speed Optimization Guide

	## Current Setup
	- GPU: RTX A5000 (24GB VRAM)
	- Current Config: `configs/training.yaml` (batch_size: 24, num_workers: 4)
	- Fast Config: `configs/training_fast_a5000.yaml` (batch_size: 36, num_workers: 12)

	## Speed Optimizations Applied

	### 1. Batch Size Increase (2-3x speedup potential)
	- Current: 24
	- Optimized: 36
	- Impact: Larger batches = more efficient GPU utilization, fewer iterations per epoch
	- Memory: With mixed precision (FP16), 36 batch size fits comfortably in 24GB VRAM

	### 2. Data Loading Optimization (20-40% speedup)
	- num_workers: 4 → 12 (3x more parallel data loading)
	- prefetch_factor: 2 → 6 (3x more batches prefetched)
	- persistent_workers: false → true (eliminates worker startup overhead)
	- Impact: GPU waits less for data, better utilization

	### 3. Mixed Precision Training (Already enabled, ~2x speedup)
	- Status: ✅ Enabled (`mixed_precision: true`)
	- Impact: ~2x faster training, ~50% less memory usage
	- Note: Critical for tiny object detection - preserves accuracy

	### 4. Memory Format Optimization (10-15% speedup)
	- channels_last: ✅ Enabled
	- Impact: Faster convolution operations on modern GPUs

	### 5. TF32 Precision (1.5x speedup on matmul)
	- Status: ✅ Enabled (`tf32: true`)
	- Impact: Faster matrix multiplications on Ampere GPUs (A5000)

	### 6. Reduced Augmentation Overhead (10-20% speedup)
	- CLAHE: Disabled (minimal accuracy impact)
	- Motion Blur: Disabled (minimal accuracy impact)
	- Impact: Faster data preprocessing

	### 7. Reduced Validation Frequency (2x speedup during validation)
	- val_frequency: 1 → 2 (validate every 2 epochs instead of every epoch)
	- Impact: Less time spent on validation

	### 8. Reduced Logging I/O (5-10% speedup)
	- print_frequency: 20 → 100 (less frequent console output)
	- log_every_n_steps: 50 → 200 (less frequent TensorBoard writes)
	- mlflow_log_models: false (no model artifact logging)

	### 9. Memory Cleanup Optimization (2-5% speedup)
	- memory_cleanup_frequency: 10 → 30 (less frequent cleanup = less overhead)
	- adaptive_adjustment_interval: 50 → 100 (less frequent checks)

	## Expected Speed Improvements

	### Overall Training Speed
	- Current: ~5 hours for 20 epochs
	- Optimized: ~2-2.5 hours for 20 epochs (2-2.5x faster)

	### Per-Epoch Speed
	- Current: ~15 minutes per epoch
	- Optimized: ~6-7 minutes per epoch

	## How to Use

	### Option 1: Use Fast Config Directly
	```bash
	python scripts/train_detr.py --config configs/training_fast_a5000.yaml
	```

	### Option 2: Modify Existing Config
	Edit `configs/training.yaml` and update:
	- `batch_size: 24` → `batch_size: 36`
	- `num_workers: 4` → `num_workers: 12`
	- `prefetch_factor: 2` → `prefetch_factor: 6`
	- `val_frequency: 1` → `val_frequency: 2` (in evaluation section)

	### Option 3: Test Batch Size Incrementally
	If you encounter OOM (Out of Memory) errors:
	1. Start with `batch_size: 32`
	2. If successful, try `batch_size: 36`
	3. If still successful, try `batch_size: 40` (maximum for A5000)

	## Memory Monitoring

	Monitor GPU memory usage during training:
	```bash
	watch -n 1 nvidia-smi
	```

	If you see memory usage > 95%, reduce batch_size by 4-8.

	## Trade-offs

	### Accuracy Impact
	- Minimal: Disabled augmentations (CLAHE, motion blur) have minimal impact
	- None: Other optimizations (batch size, data loading) don't affect accuracy

	### Stability
	- Same: All optimizations maintain training stability
	- Better: Larger batch size = more stable gradients

	## Additional Optimizations (Future)

	1. Gradient Checkpointing: Trade compute for memory (if needed for even larger batches)
	2. Model Compilation: PyTorch 2.0 `torch.compile()` (currently disabled due to variable input sizes)
	3. Data Prefetching: Pre-process data to disk cache (if I/O becomes bottleneck)

	## Notes

	- All optimizations are GPU-safe and tested
	- Mixed precision is critical - keep it enabled
	- Monitor first epoch to ensure no OOM errors
	- If training crashes, reduce batch_size by 4 and retry