Adaptive Resource Optimization

Overview

The adaptive optimizer monitors real-time resource usage (GPU utilization, RAM usage, data loading vs processing times) and provides recommendations for optimal training parameters.

How It Works

Monitoring

GPU Utilization: Tracks GPU usage percentage
RAM Usage: Monitors system RAM consumption
Timing Metrics: Tracks data loading time vs GPU processing time
Bottleneck Detection: Identifies if data loading or GPU processing is the bottleneck

Adjustment Strategies

Low GPU Utilization + Available RAM
- Increases num_workers (up to max)
- Increases prefetch_factor (up to max)
- Goal: Keep GPU busy by prefetching more data
Data Loading Bottleneck
- If data loading time > GPU processing time
- Increases prefetch_factor to reduce GPU idle time
High RAM Usage
- Reduces num_workers and prefetch_factor
- Prevents RAM overflow
High GPU Utilization (>95%)
- May indicate data-limited scenario
- Increases prefetch to ensure GPU stays fed

Limitations

Important: PyTorch DataLoaders cannot be recreated during training. The adaptive optimizer:

✅ Monitors and logs recommendations
✅ Provides real-time metrics
✅ Logs optimal settings to MLflow
❌ Cannot change DataLoader settings mid-training

Recommendations

The optimizer logs recommended settings. For the next training run, use these optimized values in configs/training.yaml:

dataset:
  num_workers: <recommended_value>  # From adaptive optimizer
  prefetch_factor: <recommended_value>  # From adaptive optimizer

Configuration

Enable in configs/training.yaml:

training:
  adaptive_optimization: true  # Enable adaptive monitoring
  target_gpu_utilization: 0.85  # Target GPU utilization (85%)
  max_ram_usage: 0.80  # Maximum RAM usage threshold (80%)
  adaptive_adjustment_interval: 50  # Check every N batches

Output

During training, you'll see:

🔧 Adaptive Optimization Adjustment (batch 150):
   num_workers: 9
   prefetch_factor: 4
   GPU util: 72.3%, RAM: 45.2%
   Data load: 0.234s, GPU process: 0.189s

At end of each epoch:

📊 Adaptive Optimization Stats:
   Adjustments made: 3
   Current workers: 9, prefetch: 4
   Avg GPU util: 78.5%, Avg RAM: 48.3%

Benefits

Real-time Monitoring: See actual resource usage during training
Bottleneck Identification: Know if data loading or GPU is the bottleneck
Optimization Recommendations: Get optimal settings for next run
MLflow Integration: All metrics logged for analysis

Best Practices

First Run: Let adaptive optimizer monitor and recommend
Second Run: Use recommended settings from first run
Iterate: Continue optimizing based on recommendations
Monitor: Check MLflow for adaptive optimization metrics

Example Workflow

Start training with conservative settings (6 workers, prefetch 2)
Adaptive optimizer monitors and recommends (e.g., 10 workers, prefetch 4)
Check recommendations in logs or MLflow
Update configs/training.yaml with recommended values
Restart training with optimized settings
Repeat until optimal balance is found