Adaptive Resource Optimization
Overview
The adaptive optimizer monitors real-time resource usage (GPU utilization, RAM usage, data loading vs processing times) and provides recommendations for optimal training parameters.
How It Works
Monitoring
- GPU Utilization: Tracks GPU usage percentage
- RAM Usage: Monitors system RAM consumption
- Timing Metrics: Tracks data loading time vs GPU processing time
- Bottleneck Detection: Identifies if data loading or GPU processing is the bottleneck
Adjustment Strategies
Low GPU Utilization + Available RAM
- Increases
num_workers(up to max) - Increases
prefetch_factor(up to max) - Goal: Keep GPU busy by prefetching more data
- Increases
Data Loading Bottleneck
- If data loading time > GPU processing time
- Increases
prefetch_factorto reduce GPU idle time
High RAM Usage
- Reduces
num_workersandprefetch_factor - Prevents RAM overflow
- Reduces
High GPU Utilization (>95%)
- May indicate data-limited scenario
- Increases prefetch to ensure GPU stays fed
Limitations
Important: PyTorch DataLoaders cannot be recreated during training. The adaptive optimizer:
- ✅ Monitors and logs recommendations
- ✅ Provides real-time metrics
- ✅ Logs optimal settings to MLflow
- ❌ Cannot change DataLoader settings mid-training
Recommendations
The optimizer logs recommended settings. For the next training run, use these optimized values in configs/training.yaml:
dataset:
num_workers: <recommended_value> # From adaptive optimizer
prefetch_factor: <recommended_value> # From adaptive optimizer
Configuration
Enable in configs/training.yaml:
training:
adaptive_optimization: true # Enable adaptive monitoring
target_gpu_utilization: 0.85 # Target GPU utilization (85%)
max_ram_usage: 0.80 # Maximum RAM usage threshold (80%)
adaptive_adjustment_interval: 50 # Check every N batches
Output
During training, you'll see:
🔧 Adaptive Optimization Adjustment (batch 150):
num_workers: 9
prefetch_factor: 4
GPU util: 72.3%, RAM: 45.2%
Data load: 0.234s, GPU process: 0.189s
At end of each epoch:
📊 Adaptive Optimization Stats:
Adjustments made: 3
Current workers: 9, prefetch: 4
Avg GPU util: 78.5%, Avg RAM: 48.3%
Benefits
- Real-time Monitoring: See actual resource usage during training
- Bottleneck Identification: Know if data loading or GPU is the bottleneck
- Optimization Recommendations: Get optimal settings for next run
- MLflow Integration: All metrics logged for analysis
Best Practices
- First Run: Let adaptive optimizer monitor and recommend
- Second Run: Use recommended settings from first run
- Iterate: Continue optimizing based on recommendations
- Monitor: Check MLflow for adaptive optimization metrics
Example Workflow
- Start training with conservative settings (6 workers, prefetch 2)
- Adaptive optimizer monitors and recommends (e.g., 10 workers, prefetch 4)
- Check recommendations in logs or MLflow
- Update
configs/training.yamlwith recommended values - Restart training with optimized settings
- Repeat until optimal balance is found