# GPU/CPU Optimal Placement Guide ## Overview GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost. ## Component Analysis ### GPU-Accelerated Operations | Component | GPU Time | CPU Time | Notes | |-----------|----------|---------|-------| | **DA3 Inference** | 10-30 sec | N/A | ✅ Must be GPU (PyTorch model) | | **Feature Extraction (SuperPoint)** | 1-2 min | N/A | ✅ Can be GPU (via hloc) | | **Feature Matching (LightGlue)** | 2-5 min | N/A | ✅ Can be GPU (via hloc) | | **Training** | Hours | N/A | ✅ Must be GPU (PyTorch) | ### CPU-Only Operations | Component | GPU Time | CPU Time | Notes | |-----------|----------|---------|-------| | **COLMAP BA** | N/A | 2-8 min | ❌ CPU-only (no GPU support) | | **Early Filtering** | N/A | <1 sec | ✅ CPU (negligible) | | **Data Loading** | N/A | <1 sec | ✅ CPU (I/O bound) | | **Cache Operations** | N/A | <1 sec | ✅ CPU (I/O bound) | ## Current Pipeline Flow ### Sequential (Inefficient): ``` Sequence 1: GPU: DA3 inference (30s) → Feature extraction (2min) → Matching (5min) CPU: BA (5min) [waits for GPU] Total: ~12 min Sequence 2: GPU: DA3 inference (30s) → Feature extraction (2min) → Matching (5min) CPU: BA (5min) [waits for GPU] Total: ~12 min Total: 24 min (GPU idle during BA, CPU idle during GPU ops) ``` ### Optimized Pipeline: ``` Parallel Execution: GPU: DA3 inference (Sequence 1) → Feature extraction → Matching CPU: BA (Sequence 2) [from cache or previous run] GPU: DA3 inference (Sequence 2) → Feature extraction → Matching CPU: BA (Sequence 3) [from cache or previous run] GPU: Training (batched) CPU: BA (other sequences in parallel) ``` ## Optimal Placement Strategy ### Strategy 1: Separate GPU and CPU Workflows (Recommended) **Phase 1: Dataset Building (GPU + CPU in parallel)** ``` GPU Pipeline (one sequence at a time): 1. DA3 inference (30s) 2. Feature extraction (2min) 3. Feature matching (5min) Total: ~7-8 min per sequence CPU Pipeline (parallel workers): 1. BA validation (5-8 min per sequence) 2. Can run 4-8 BA jobs in parallel (CPU cores) Key: GPU and CPU work on different sequences simultaneously ``` **Phase 2: Training (GPU only)** ``` GPU: Training (hours) CPU: Idle (or can pre-process next batch) ``` ### Strategy 2: Pre-Compute BA on CPU Cluster **Use cheap CPU instances for BA:** ``` 1. Run BA on CPU cluster (spot instances, cheaper) - 100 sequences × 5 min = 8 hours - Cost: ~$10-20 (vs $100+ on GPU instance) 2. Run DA3 + Training on GPU instance - DA3 inference: 50 min - Training: 20-40 hours - Cost: GPU instance time Total cost savings: 80-90% for BA phase ``` ### Strategy 3: Hybrid Approach (Best for Development) **Single GPU instance with smart scheduling:** ```python # Pseudo-code for optimal scheduling def process_sequences_optimally(sequences): gpu_queue = [] cpu_queue = [] for seq in sequences: # Check cache first if ba_cached(seq): # Only need DA3 inference (GPU) gpu_queue.append(seq) else: # Need full pipeline # Schedule GPU work first gpu_queue.append(seq) # Schedule BA on CPU (can run in parallel) cpu_queue.append(seq) # Process GPU queue (one at a time) with GPU(): for seq in gpu_queue: da3_inference(seq) extract_features(seq) match_features(seq) # Process CPU queue (parallel workers) with ThreadPoolExecutor(max_workers=8): for seq in cpu_queue: run_ba(seq) # CPU-only ``` ## Implementation Recommendations ### 1. Separate GPU and CPU Workers **Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.** **Recommended changes:** ```python # In pretrain.py class ARKitPretrainPipeline: def __init__(self, ..., gpu_device="cuda", cpu_workers=8): self.gpu_device = gpu_device self.cpu_workers = cpu_workers def process_arkit_sequence(self, ...): # GPU work with torch.cuda.device(self.gpu_device): da3_output = self.model.inference(images) features = extract_features_gpu(images) matches = match_features_gpu(features) # CPU work (can run in parallel with other sequences) ba_result = self._run_ba_cpu(images, features, matches) return sample ``` ### 2. Pipeline Separation **Separate dataset building into GPU and CPU phases:** ```python # Phase 1: GPU work (sequential, one sequence at a time) def build_dataset_gpu_phase(sequences): results = [] for seq in sequences: if ba_cached(seq): # Only GPU work needed da3_output = model.inference(seq.images) results.append({ 'seq': seq, 'da3_output': da3_output, 'needs_ba': False }) else: # Full GPU pipeline features = extract_features(seq.images) matches = match_features(features) results.append({ 'seq': seq, 'features': features, 'matches': matches, 'needs_ba': True }) return results # Phase 2: CPU work (parallel, many sequences at once) def build_dataset_cpu_phase(gpu_results): with ThreadPoolExecutor(max_workers=8) as executor: futures = [] for result in gpu_results: if result['needs_ba']: future = executor.submit( run_ba_cpu, result['seq'], result['features'], result['matches'] ) futures.append(future) # Collect results ba_results = [f.result() for f in futures] return ba_results ``` ### 3. Resource-Aware Scheduling **Schedule based on resource availability:** ```python class ResourceAwareScheduler: def __init__(self, gpu_device="cuda", cpu_workers=8): self.gpu_queue = Queue() self.cpu_queue = Queue() self.gpu_busy = False self.cpu_slots = cpu_workers def schedule(self, sequence): if ba_cached(sequence): # Only GPU work self.gpu_queue.put(('inference_only', sequence)) else: # GPU work first self.gpu_queue.put(('full_pipeline', sequence)) # Then CPU work self.cpu_queue.put(('ba', sequence)) def process(self): # GPU worker (sequential) while not self.gpu_queue.empty(): task_type, seq = self.gpu_queue.get() if task_type == 'inference_only': da3_inference(seq) else: features = extract_features(seq) matches = match_features(features) self.cpu_queue.put(('ba_with_data', seq, features, matches)) # CPU workers (parallel) with ThreadPoolExecutor(max_workers=self.cpu_slots): while not self.cpu_queue.empty(): task = self.cpu_queue.get() if task[0] == 'ba': run_ba(task[1]) else: run_ba(task[1], task[2], task[3]) ``` ## Cost Optimization Strategies ### 1. Use Spot Instances for BA **BA is CPU-only and can run on cheap spot instances:** ```bash # Run BA on spot instance (10x cheaper) aws ec2 run-instances \ --instance-type c5.4xlarge \ --spot-price 0.10 \ --instance-market-options file://spot-config.json # Cost: ~$0.10/hour vs $1.00/hour for GPU # 100 sequences × 5 min = 8 hours = $0.80 vs $8.00 ``` ### 2. Pre-Compute BA Offline **Run BA on local machine or cheap CPU cluster:** ```bash # On local machine or CPU cluster ylff train pretrain data/arkit_sequences \ --epochs 0 \ # Just build dataset --num-workers 8 \ --cache-dir data/pretrain_cache # Then train on GPU instance (expensive) ylff train pretrain data/arkit_sequences \ --epochs 20 \ --cache-dir data/pretrain_cache # Uses cached BA ``` ### 3. Hybrid Cloud Strategy **Use different instance types for different phases:** ``` Phase 1: Dataset Building - GPU instance (1x): DA3 inference, feature extraction/matching - CPU instance (8x spot): BA validation - Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr - Time: 2-3 hours - Total: ~$8-10 Phase 2: Training - GPU instance (1x): Training only - Cost: $2/hr × 20 hours = $40 - Total: $40 Total: $50 (vs $100+ if all on GPU) ``` ## Recommended Implementation ### For Development (Single Machine): ```python # Optimal single-machine setup class OptimalPretrainPipeline: def __init__(self, gpu_device="cuda", cpu_workers=8): self.gpu_device = gpu_device self.cpu_workers = cpu_workers def build_dataset(self, sequences): # Separate GPU and CPU work gpu_results = [] cpu_tasks = [] # Phase 1: GPU work (sequential) for seq in sequences: if self._is_ba_cached(seq): # Only inference needed result = self._run_da3_inference(seq) gpu_results.append(result) else: # Full GPU pipeline result = self._run_gpu_pipeline(seq) gpu_results.append(result) cpu_tasks.append(result) # Needs BA # Phase 2: CPU work (parallel) with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor: ba_futures = { executor.submit(self._run_ba_cpu, task): task for task in cpu_tasks } # Collect BA results for future in as_completed(ba_futures): ba_result = future.result() # Merge with GPU result self._merge_results(ba_futures[future], ba_result) return gpu_results ``` ### For Production (Distributed): ```python # Distributed setup class DistributedPretrainPipeline: def __init__(self, gpu_nodes=1, cpu_nodes=8): self.gpu_nodes = gpu_nodes self.cpu_nodes = cpu_nodes def build_dataset(self, sequences): # GPU nodes: DA3 inference, features, matching gpu_tasks = self._distribute_gpu_work(sequences) # CPU nodes: BA validation (parallel) cpu_tasks = self._distribute_cpu_work(sequences) # Collect and merge return self._collect_results(gpu_tasks, cpu_tasks) ``` ## Performance Comparison ### Current (Sequential): ``` 100 sequences: GPU: 7 min × 100 = 700 min (11.7 hours) CPU: 5 min × 100 = 500 min (8.3 hours) Total: 20 hours (sequential) GPU utilization: 50% CPU utilization: 50% ``` ### Optimized (Parallel): ``` 100 sequences: GPU: 7 min × 100 = 700 min (11.7 hours) [sequential] CPU: 5 min × 100 / 8 workers = 62.5 min (1 hour) [parallel] Total: 12.7 hours (overlapped) GPU utilization: 90% CPU utilization: 90% Speedup: 1.6x ``` ### With Caching: ``` 100 sequences (50 cached): GPU: 7 min × 50 = 350 min (5.8 hours) CPU: 5 min × 50 / 8 workers = 31 min [parallel] Total: 6 hours Speedup: 3.3x vs sequential ``` ## Recommendations ### Immediate Actions: 1. ✅ **Separate GPU and CPU work** in pipeline 2. ✅ **Use parallel CPU workers** for BA (already done) 3. ✅ **Pre-compute BA** on cheap CPU instances 4. ✅ **Cache aggressively** (already done) ### Future Optimizations: 1. **Batch GPU operations**: Process multiple sequences on GPU simultaneously 2. **Pipeline overlap**: Start CPU BA while GPU processes next sequence 3. **Distributed BA**: Run BA on multiple CPU nodes 4. **GPU feature extraction**: Ensure SuperPoint/LightGlue use GPU ### Cost Savings: - **Current**: All on GPU instance = $100-200 for 100 sequences - **Optimized**: GPU + CPU spot = $20-40 for 100 sequences - **Savings**: 80-90% reduction in compute costs