| # GPU/CPU Optimal Placement Guide | |
| ## Overview | |
| GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost. | |
| ## Component Analysis | |
| ### GPU-Accelerated Operations | |
| | Component | GPU Time | CPU Time | Notes | | |
| |-----------|----------|---------|-------| | |
| | **DA3 Inference** | 10-30 sec | N/A | β Must be GPU (PyTorch model) | | |
| | **Feature Extraction (SuperPoint)** | 1-2 min | N/A | β Can be GPU (via hloc) | | |
| | **Feature Matching (LightGlue)** | 2-5 min | N/A | β Can be GPU (via hloc) | | |
| | **Training** | Hours | N/A | β Must be GPU (PyTorch) | | |
| ### CPU-Only Operations | |
| | Component | GPU Time | CPU Time | Notes | | |
| |-----------|----------|---------|-------| | |
| | **COLMAP BA** | N/A | 2-8 min | β CPU-only (no GPU support) | | |
| | **Early Filtering** | N/A | <1 sec | β CPU (negligible) | | |
| | **Data Loading** | N/A | <1 sec | β CPU (I/O bound) | | |
| | **Cache Operations** | N/A | <1 sec | β CPU (I/O bound) | | |
| ## Current Pipeline Flow | |
| ### Sequential (Inefficient): | |
| ``` | |
| Sequence 1: | |
| GPU: DA3 inference (30s) β Feature extraction (2min) β Matching (5min) | |
| CPU: BA (5min) [waits for GPU] | |
| Total: ~12 min | |
| Sequence 2: | |
| GPU: DA3 inference (30s) β Feature extraction (2min) β Matching (5min) | |
| CPU: BA (5min) [waits for GPU] | |
| Total: ~12 min | |
| Total: 24 min (GPU idle during BA, CPU idle during GPU ops) | |
| ``` | |
| ### Optimized Pipeline: | |
| ``` | |
| Parallel Execution: | |
| GPU: DA3 inference (Sequence 1) β Feature extraction β Matching | |
| CPU: BA (Sequence 2) [from cache or previous run] | |
| GPU: DA3 inference (Sequence 2) β Feature extraction β Matching | |
| CPU: BA (Sequence 3) [from cache or previous run] | |
| GPU: Training (batched) | |
| CPU: BA (other sequences in parallel) | |
| ``` | |
| ## Optimal Placement Strategy | |
| ### Strategy 1: Separate GPU and CPU Workflows (Recommended) | |
| **Phase 1: Dataset Building (GPU + CPU in parallel)** | |
| ``` | |
| GPU Pipeline (one sequence at a time): | |
| 1. DA3 inference (30s) | |
| 2. Feature extraction (2min) | |
| 3. Feature matching (5min) | |
| Total: ~7-8 min per sequence | |
| CPU Pipeline (parallel workers): | |
| 1. BA validation (5-8 min per sequence) | |
| 2. Can run 4-8 BA jobs in parallel (CPU cores) | |
| Key: GPU and CPU work on different sequences simultaneously | |
| ``` | |
| **Phase 2: Training (GPU only)** | |
| ``` | |
| GPU: Training (hours) | |
| CPU: Idle (or can pre-process next batch) | |
| ``` | |
| ### Strategy 2: Pre-Compute BA on CPU Cluster | |
| **Use cheap CPU instances for BA:** | |
| ``` | |
| 1. Run BA on CPU cluster (spot instances, cheaper) | |
| - 100 sequences Γ 5 min = 8 hours | |
| - Cost: ~$10-20 (vs $100+ on GPU instance) | |
| 2. Run DA3 + Training on GPU instance | |
| - DA3 inference: 50 min | |
| - Training: 20-40 hours | |
| - Cost: GPU instance time | |
| Total cost savings: 80-90% for BA phase | |
| ``` | |
| ### Strategy 3: Hybrid Approach (Best for Development) | |
| **Single GPU instance with smart scheduling:** | |
| ```python | |
| # Pseudo-code for optimal scheduling | |
| def process_sequences_optimally(sequences): | |
| gpu_queue = [] | |
| cpu_queue = [] | |
| for seq in sequences: | |
| # Check cache first | |
| if ba_cached(seq): | |
| # Only need DA3 inference (GPU) | |
| gpu_queue.append(seq) | |
| else: | |
| # Need full pipeline | |
| # Schedule GPU work first | |
| gpu_queue.append(seq) | |
| # Schedule BA on CPU (can run in parallel) | |
| cpu_queue.append(seq) | |
| # Process GPU queue (one at a time) | |
| with GPU(): | |
| for seq in gpu_queue: | |
| da3_inference(seq) | |
| extract_features(seq) | |
| match_features(seq) | |
| # Process CPU queue (parallel workers) | |
| with ThreadPoolExecutor(max_workers=8): | |
| for seq in cpu_queue: | |
| run_ba(seq) # CPU-only | |
| ``` | |
| ## Implementation Recommendations | |
| ### 1. Separate GPU and CPU Workers | |
| **Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.** | |
| **Recommended changes:** | |
| ```python | |
| # In pretrain.py | |
| class ARKitPretrainPipeline: | |
| def __init__(self, ..., gpu_device="cuda", cpu_workers=8): | |
| self.gpu_device = gpu_device | |
| self.cpu_workers = cpu_workers | |
| def process_arkit_sequence(self, ...): | |
| # GPU work | |
| with torch.cuda.device(self.gpu_device): | |
| da3_output = self.model.inference(images) | |
| features = extract_features_gpu(images) | |
| matches = match_features_gpu(features) | |
| # CPU work (can run in parallel with other sequences) | |
| ba_result = self._run_ba_cpu(images, features, matches) | |
| return sample | |
| ``` | |
| ### 2. Pipeline Separation | |
| **Separate dataset building into GPU and CPU phases:** | |
| ```python | |
| # Phase 1: GPU work (sequential, one sequence at a time) | |
| def build_dataset_gpu_phase(sequences): | |
| results = [] | |
| for seq in sequences: | |
| if ba_cached(seq): | |
| # Only GPU work needed | |
| da3_output = model.inference(seq.images) | |
| results.append({ | |
| 'seq': seq, | |
| 'da3_output': da3_output, | |
| 'needs_ba': False | |
| }) | |
| else: | |
| # Full GPU pipeline | |
| features = extract_features(seq.images) | |
| matches = match_features(features) | |
| results.append({ | |
| 'seq': seq, | |
| 'features': features, | |
| 'matches': matches, | |
| 'needs_ba': True | |
| }) | |
| return results | |
| # Phase 2: CPU work (parallel, many sequences at once) | |
| def build_dataset_cpu_phase(gpu_results): | |
| with ThreadPoolExecutor(max_workers=8) as executor: | |
| futures = [] | |
| for result in gpu_results: | |
| if result['needs_ba']: | |
| future = executor.submit( | |
| run_ba_cpu, | |
| result['seq'], | |
| result['features'], | |
| result['matches'] | |
| ) | |
| futures.append(future) | |
| # Collect results | |
| ba_results = [f.result() for f in futures] | |
| return ba_results | |
| ``` | |
| ### 3. Resource-Aware Scheduling | |
| **Schedule based on resource availability:** | |
| ```python | |
| class ResourceAwareScheduler: | |
| def __init__(self, gpu_device="cuda", cpu_workers=8): | |
| self.gpu_queue = Queue() | |
| self.cpu_queue = Queue() | |
| self.gpu_busy = False | |
| self.cpu_slots = cpu_workers | |
| def schedule(self, sequence): | |
| if ba_cached(sequence): | |
| # Only GPU work | |
| self.gpu_queue.put(('inference_only', sequence)) | |
| else: | |
| # GPU work first | |
| self.gpu_queue.put(('full_pipeline', sequence)) | |
| # Then CPU work | |
| self.cpu_queue.put(('ba', sequence)) | |
| def process(self): | |
| # GPU worker (sequential) | |
| while not self.gpu_queue.empty(): | |
| task_type, seq = self.gpu_queue.get() | |
| if task_type == 'inference_only': | |
| da3_inference(seq) | |
| else: | |
| features = extract_features(seq) | |
| matches = match_features(features) | |
| self.cpu_queue.put(('ba_with_data', seq, features, matches)) | |
| # CPU workers (parallel) | |
| with ThreadPoolExecutor(max_workers=self.cpu_slots): | |
| while not self.cpu_queue.empty(): | |
| task = self.cpu_queue.get() | |
| if task[0] == 'ba': | |
| run_ba(task[1]) | |
| else: | |
| run_ba(task[1], task[2], task[3]) | |
| ``` | |
| ## Cost Optimization Strategies | |
| ### 1. Use Spot Instances for BA | |
| **BA is CPU-only and can run on cheap spot instances:** | |
| ```bash | |
| # Run BA on spot instance (10x cheaper) | |
| aws ec2 run-instances \ | |
| --instance-type c5.4xlarge \ | |
| --spot-price 0.10 \ | |
| --instance-market-options file://spot-config.json | |
| # Cost: ~$0.10/hour vs $1.00/hour for GPU | |
| # 100 sequences Γ 5 min = 8 hours = $0.80 vs $8.00 | |
| ``` | |
| ### 2. Pre-Compute BA Offline | |
| **Run BA on local machine or cheap CPU cluster:** | |
| ```bash | |
| # On local machine or CPU cluster | |
| ylff train pretrain data/arkit_sequences \ | |
| --epochs 0 \ # Just build dataset | |
| --num-workers 8 \ | |
| --cache-dir data/pretrain_cache | |
| # Then train on GPU instance (expensive) | |
| ylff train pretrain data/arkit_sequences \ | |
| --epochs 20 \ | |
| --cache-dir data/pretrain_cache # Uses cached BA | |
| ``` | |
| ### 3. Hybrid Cloud Strategy | |
| **Use different instance types for different phases:** | |
| ``` | |
| Phase 1: Dataset Building | |
| - GPU instance (1x): DA3 inference, feature extraction/matching | |
| - CPU instance (8x spot): BA validation | |
| - Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr | |
| - Time: 2-3 hours | |
| - Total: ~$8-10 | |
| Phase 2: Training | |
| - GPU instance (1x): Training only | |
| - Cost: $2/hr Γ 20 hours = $40 | |
| - Total: $40 | |
| Total: $50 (vs $100+ if all on GPU) | |
| ``` | |
| ## Recommended Implementation | |
| ### For Development (Single Machine): | |
| ```python | |
| # Optimal single-machine setup | |
| class OptimalPretrainPipeline: | |
| def __init__(self, gpu_device="cuda", cpu_workers=8): | |
| self.gpu_device = gpu_device | |
| self.cpu_workers = cpu_workers | |
| def build_dataset(self, sequences): | |
| # Separate GPU and CPU work | |
| gpu_results = [] | |
| cpu_tasks = [] | |
| # Phase 1: GPU work (sequential) | |
| for seq in sequences: | |
| if self._is_ba_cached(seq): | |
| # Only inference needed | |
| result = self._run_da3_inference(seq) | |
| gpu_results.append(result) | |
| else: | |
| # Full GPU pipeline | |
| result = self._run_gpu_pipeline(seq) | |
| gpu_results.append(result) | |
| cpu_tasks.append(result) # Needs BA | |
| # Phase 2: CPU work (parallel) | |
| with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor: | |
| ba_futures = { | |
| executor.submit(self._run_ba_cpu, task): task | |
| for task in cpu_tasks | |
| } | |
| # Collect BA results | |
| for future in as_completed(ba_futures): | |
| ba_result = future.result() | |
| # Merge with GPU result | |
| self._merge_results(ba_futures[future], ba_result) | |
| return gpu_results | |
| ``` | |
| ### For Production (Distributed): | |
| ```python | |
| # Distributed setup | |
| class DistributedPretrainPipeline: | |
| def __init__(self, gpu_nodes=1, cpu_nodes=8): | |
| self.gpu_nodes = gpu_nodes | |
| self.cpu_nodes = cpu_nodes | |
| def build_dataset(self, sequences): | |
| # GPU nodes: DA3 inference, features, matching | |
| gpu_tasks = self._distribute_gpu_work(sequences) | |
| # CPU nodes: BA validation (parallel) | |
| cpu_tasks = self._distribute_cpu_work(sequences) | |
| # Collect and merge | |
| return self._collect_results(gpu_tasks, cpu_tasks) | |
| ``` | |
| ## Performance Comparison | |
| ### Current (Sequential): | |
| ``` | |
| 100 sequences: | |
| GPU: 7 min Γ 100 = 700 min (11.7 hours) | |
| CPU: 5 min Γ 100 = 500 min (8.3 hours) | |
| Total: 20 hours (sequential) | |
| GPU utilization: 50% | |
| CPU utilization: 50% | |
| ``` | |
| ### Optimized (Parallel): | |
| ``` | |
| 100 sequences: | |
| GPU: 7 min Γ 100 = 700 min (11.7 hours) [sequential] | |
| CPU: 5 min Γ 100 / 8 workers = 62.5 min (1 hour) [parallel] | |
| Total: 12.7 hours (overlapped) | |
| GPU utilization: 90% | |
| CPU utilization: 90% | |
| Speedup: 1.6x | |
| ``` | |
| ### With Caching: | |
| ``` | |
| 100 sequences (50 cached): | |
| GPU: 7 min Γ 50 = 350 min (5.8 hours) | |
| CPU: 5 min Γ 50 / 8 workers = 31 min [parallel] | |
| Total: 6 hours | |
| Speedup: 3.3x vs sequential | |
| ``` | |
| ## Recommendations | |
| ### Immediate Actions: | |
| 1. β **Separate GPU and CPU work** in pipeline | |
| 2. β **Use parallel CPU workers** for BA (already done) | |
| 3. β **Pre-compute BA** on cheap CPU instances | |
| 4. β **Cache aggressively** (already done) | |
| ### Future Optimizations: | |
| 1. **Batch GPU operations**: Process multiple sequences on GPU simultaneously | |
| 2. **Pipeline overlap**: Start CPU BA while GPU processes next sequence | |
| 3. **Distributed BA**: Run BA on multiple CPU nodes | |
| 4. **GPU feature extraction**: Ensure SuperPoint/LightGlue use GPU | |
| ### Cost Savings: | |
| - **Current**: All on GPU instance = $100-200 for 100 sequences | |
| - **Optimized**: GPU + CPU spot = $20-40 for 100 sequences | |
| - **Savings**: 80-90% reduction in compute costs | |