GPU/CPU Optimal Placement Guide
Overview
GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost.
Component Analysis
GPU-Accelerated Operations
| Component | GPU Time | CPU Time | Notes |
|---|---|---|---|
| DA3 Inference | 10-30 sec | N/A | β Must be GPU (PyTorch model) |
| Feature Extraction (SuperPoint) | 1-2 min | N/A | β Can be GPU (via hloc) |
| Feature Matching (LightGlue) | 2-5 min | N/A | β Can be GPU (via hloc) |
| Training | Hours | N/A | β Must be GPU (PyTorch) |
CPU-Only Operations
| Component | GPU Time | CPU Time | Notes |
|---|---|---|---|
| COLMAP BA | N/A | 2-8 min | β CPU-only (no GPU support) |
| Early Filtering | N/A | <1 sec | β CPU (negligible) |
| Data Loading | N/A | <1 sec | β CPU (I/O bound) |
| Cache Operations | N/A | <1 sec | β CPU (I/O bound) |
Current Pipeline Flow
Sequential (Inefficient):
Sequence 1:
GPU: DA3 inference (30s) β Feature extraction (2min) β Matching (5min)
CPU: BA (5min) [waits for GPU]
Total: ~12 min
Sequence 2:
GPU: DA3 inference (30s) β Feature extraction (2min) β Matching (5min)
CPU: BA (5min) [waits for GPU]
Total: ~12 min
Total: 24 min (GPU idle during BA, CPU idle during GPU ops)
Optimized Pipeline:
Parallel Execution:
GPU: DA3 inference (Sequence 1) β Feature extraction β Matching
CPU: BA (Sequence 2) [from cache or previous run]
GPU: DA3 inference (Sequence 2) β Feature extraction β Matching
CPU: BA (Sequence 3) [from cache or previous run]
GPU: Training (batched)
CPU: BA (other sequences in parallel)
Optimal Placement Strategy
Strategy 1: Separate GPU and CPU Workflows (Recommended)
Phase 1: Dataset Building (GPU + CPU in parallel)
GPU Pipeline (one sequence at a time):
1. DA3 inference (30s)
2. Feature extraction (2min)
3. Feature matching (5min)
Total: ~7-8 min per sequence
CPU Pipeline (parallel workers):
1. BA validation (5-8 min per sequence)
2. Can run 4-8 BA jobs in parallel (CPU cores)
Key: GPU and CPU work on different sequences simultaneously
Phase 2: Training (GPU only)
GPU: Training (hours)
CPU: Idle (or can pre-process next batch)
Strategy 2: Pre-Compute BA on CPU Cluster
Use cheap CPU instances for BA:
1. Run BA on CPU cluster (spot instances, cheaper)
- 100 sequences Γ 5 min = 8 hours
- Cost: ~$10-20 (vs $100+ on GPU instance)
2. Run DA3 + Training on GPU instance
- DA3 inference: 50 min
- Training: 20-40 hours
- Cost: GPU instance time
Total cost savings: 80-90% for BA phase
Strategy 3: Hybrid Approach (Best for Development)
Single GPU instance with smart scheduling:
# Pseudo-code for optimal scheduling
def process_sequences_optimally(sequences):
gpu_queue = []
cpu_queue = []
for seq in sequences:
# Check cache first
if ba_cached(seq):
# Only need DA3 inference (GPU)
gpu_queue.append(seq)
else:
# Need full pipeline
# Schedule GPU work first
gpu_queue.append(seq)
# Schedule BA on CPU (can run in parallel)
cpu_queue.append(seq)
# Process GPU queue (one at a time)
with GPU():
for seq in gpu_queue:
da3_inference(seq)
extract_features(seq)
match_features(seq)
# Process CPU queue (parallel workers)
with ThreadPoolExecutor(max_workers=8):
for seq in cpu_queue:
run_ba(seq) # CPU-only
Implementation Recommendations
1. Separate GPU and CPU Workers
Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.
Recommended changes:
# In pretrain.py
class ARKitPretrainPipeline:
def __init__(self, ..., gpu_device="cuda", cpu_workers=8):
self.gpu_device = gpu_device
self.cpu_workers = cpu_workers
def process_arkit_sequence(self, ...):
# GPU work
with torch.cuda.device(self.gpu_device):
da3_output = self.model.inference(images)
features = extract_features_gpu(images)
matches = match_features_gpu(features)
# CPU work (can run in parallel with other sequences)
ba_result = self._run_ba_cpu(images, features, matches)
return sample
2. Pipeline Separation
Separate dataset building into GPU and CPU phases:
# Phase 1: GPU work (sequential, one sequence at a time)
def build_dataset_gpu_phase(sequences):
results = []
for seq in sequences:
if ba_cached(seq):
# Only GPU work needed
da3_output = model.inference(seq.images)
results.append({
'seq': seq,
'da3_output': da3_output,
'needs_ba': False
})
else:
# Full GPU pipeline
features = extract_features(seq.images)
matches = match_features(features)
results.append({
'seq': seq,
'features': features,
'matches': matches,
'needs_ba': True
})
return results
# Phase 2: CPU work (parallel, many sequences at once)
def build_dataset_cpu_phase(gpu_results):
with ThreadPoolExecutor(max_workers=8) as executor:
futures = []
for result in gpu_results:
if result['needs_ba']:
future = executor.submit(
run_ba_cpu,
result['seq'],
result['features'],
result['matches']
)
futures.append(future)
# Collect results
ba_results = [f.result() for f in futures]
return ba_results
3. Resource-Aware Scheduling
Schedule based on resource availability:
class ResourceAwareScheduler:
def __init__(self, gpu_device="cuda", cpu_workers=8):
self.gpu_queue = Queue()
self.cpu_queue = Queue()
self.gpu_busy = False
self.cpu_slots = cpu_workers
def schedule(self, sequence):
if ba_cached(sequence):
# Only GPU work
self.gpu_queue.put(('inference_only', sequence))
else:
# GPU work first
self.gpu_queue.put(('full_pipeline', sequence))
# Then CPU work
self.cpu_queue.put(('ba', sequence))
def process(self):
# GPU worker (sequential)
while not self.gpu_queue.empty():
task_type, seq = self.gpu_queue.get()
if task_type == 'inference_only':
da3_inference(seq)
else:
features = extract_features(seq)
matches = match_features(features)
self.cpu_queue.put(('ba_with_data', seq, features, matches))
# CPU workers (parallel)
with ThreadPoolExecutor(max_workers=self.cpu_slots):
while not self.cpu_queue.empty():
task = self.cpu_queue.get()
if task[0] == 'ba':
run_ba(task[1])
else:
run_ba(task[1], task[2], task[3])
Cost Optimization Strategies
1. Use Spot Instances for BA
BA is CPU-only and can run on cheap spot instances:
# Run BA on spot instance (10x cheaper)
aws ec2 run-instances \
--instance-type c5.4xlarge \
--spot-price 0.10 \
--instance-market-options file://spot-config.json
# Cost: ~$0.10/hour vs $1.00/hour for GPU
# 100 sequences Γ 5 min = 8 hours = $0.80 vs $8.00
2. Pre-Compute BA Offline
Run BA on local machine or cheap CPU cluster:
# On local machine or CPU cluster
ylff train pretrain data/arkit_sequences \
--epochs 0 \ # Just build dataset
--num-workers 8 \
--cache-dir data/pretrain_cache
# Then train on GPU instance (expensive)
ylff train pretrain data/arkit_sequences \
--epochs 20 \
--cache-dir data/pretrain_cache # Uses cached BA
3. Hybrid Cloud Strategy
Use different instance types for different phases:
Phase 1: Dataset Building
- GPU instance (1x): DA3 inference, feature extraction/matching
- CPU instance (8x spot): BA validation
- Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr
- Time: 2-3 hours
- Total: ~$8-10
Phase 2: Training
- GPU instance (1x): Training only
- Cost: $2/hr Γ 20 hours = $40
- Total: $40
Total: $50 (vs $100+ if all on GPU)
Recommended Implementation
For Development (Single Machine):
# Optimal single-machine setup
class OptimalPretrainPipeline:
def __init__(self, gpu_device="cuda", cpu_workers=8):
self.gpu_device = gpu_device
self.cpu_workers = cpu_workers
def build_dataset(self, sequences):
# Separate GPU and CPU work
gpu_results = []
cpu_tasks = []
# Phase 1: GPU work (sequential)
for seq in sequences:
if self._is_ba_cached(seq):
# Only inference needed
result = self._run_da3_inference(seq)
gpu_results.append(result)
else:
# Full GPU pipeline
result = self._run_gpu_pipeline(seq)
gpu_results.append(result)
cpu_tasks.append(result) # Needs BA
# Phase 2: CPU work (parallel)
with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor:
ba_futures = {
executor.submit(self._run_ba_cpu, task): task
for task in cpu_tasks
}
# Collect BA results
for future in as_completed(ba_futures):
ba_result = future.result()
# Merge with GPU result
self._merge_results(ba_futures[future], ba_result)
return gpu_results
For Production (Distributed):
# Distributed setup
class DistributedPretrainPipeline:
def __init__(self, gpu_nodes=1, cpu_nodes=8):
self.gpu_nodes = gpu_nodes
self.cpu_nodes = cpu_nodes
def build_dataset(self, sequences):
# GPU nodes: DA3 inference, features, matching
gpu_tasks = self._distribute_gpu_work(sequences)
# CPU nodes: BA validation (parallel)
cpu_tasks = self._distribute_cpu_work(sequences)
# Collect and merge
return self._collect_results(gpu_tasks, cpu_tasks)
Performance Comparison
Current (Sequential):
100 sequences:
GPU: 7 min Γ 100 = 700 min (11.7 hours)
CPU: 5 min Γ 100 = 500 min (8.3 hours)
Total: 20 hours (sequential)
GPU utilization: 50%
CPU utilization: 50%
Optimized (Parallel):
100 sequences:
GPU: 7 min Γ 100 = 700 min (11.7 hours) [sequential]
CPU: 5 min Γ 100 / 8 workers = 62.5 min (1 hour) [parallel]
Total: 12.7 hours (overlapped)
GPU utilization: 90%
CPU utilization: 90%
Speedup: 1.6x
With Caching:
100 sequences (50 cached):
GPU: 7 min Γ 50 = 350 min (5.8 hours)
CPU: 5 min Γ 50 / 8 workers = 31 min [parallel]
Total: 6 hours
Speedup: 3.3x vs sequential
Recommendations
Immediate Actions:
- β Separate GPU and CPU work in pipeline
- β Use parallel CPU workers for BA (already done)
- β Pre-compute BA on cheap CPU instances
- β Cache aggressively (already done)
Future Optimizations:
- Batch GPU operations: Process multiple sequences on GPU simultaneously
- Pipeline overlap: Start CPU BA while GPU processes next sequence
- Distributed BA: Run BA on multiple CPU nodes
- GPU feature extraction: Ensure SuperPoint/LightGlue use GPU
Cost Savings:
- Current: All on GPU instance = $100-200 for 100 sequences
- Optimized: GPU + CPU spot = $20-40 for 100 sequences
- Savings: 80-90% reduction in compute costs