Spaces:

azan888
/

3d_model

Running

App Files Files Community

3d_model / docs /GPU_CPU_PLACEMENT.md

Azan

Clean deployment build (Squashed)

7a87926 4 days ago

preview code

raw

history blame contribute delete

12.1 kB

	# GPU/CPU Optimal Placement Guide

	## Overview

	GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost.

	## Component Analysis

	### GPU-Accelerated Operations

	\| Component \| GPU Time \| CPU Time \| Notes \|
	\|-----------\|----------\|---------\|-------\|
	\| DA3 Inference \| 10-30 sec \| N/A \| ✅ Must be GPU (PyTorch model) \|
	\| Feature Extraction (SuperPoint) \| 1-2 min \| N/A \| ✅ Can be GPU (via hloc) \|
	\| Feature Matching (LightGlue) \| 2-5 min \| N/A \| ✅ Can be GPU (via hloc) \|
	\| Training \| Hours \| N/A \| ✅ Must be GPU (PyTorch) \|

	### CPU-Only Operations

	\| Component \| GPU Time \| CPU Time \| Notes \|
	\|-----------\|----------\|---------\|-------\|
	\| COLMAP BA \| N/A \| 2-8 min \| ❌ CPU-only (no GPU support) \|
	\| Early Filtering \| N/A \| <1 sec \| ✅ CPU (negligible) \|
	\| Data Loading \| N/A \| <1 sec \| ✅ CPU (I/O bound) \|
	\| Cache Operations \| N/A \| <1 sec \| ✅ CPU (I/O bound) \|

	## Current Pipeline Flow

	### Sequential (Inefficient):
	```
	Sequence 1:
	GPU: DA3 inference (30s) → Feature extraction (2min) → Matching (5min)
	CPU: BA (5min) [waits for GPU]
	Total: ~12 min

	Sequence 2:
	GPU: DA3 inference (30s) → Feature extraction (2min) → Matching (5min)
	CPU: BA (5min) [waits for GPU]
	Total: ~12 min

	Total: 24 min (GPU idle during BA, CPU idle during GPU ops)
	```

	### Optimized Pipeline:

	```
	Parallel Execution:
	GPU: DA3 inference (Sequence 1) → Feature extraction → Matching
	CPU: BA (Sequence 2) [from cache or previous run]

	GPU: DA3 inference (Sequence 2) → Feature extraction → Matching
	CPU: BA (Sequence 3) [from cache or previous run]

	GPU: Training (batched)
	CPU: BA (other sequences in parallel)
	```

	## Optimal Placement Strategy

	### Strategy 1: Separate GPU and CPU Workflows (Recommended)

	Phase 1: Dataset Building (GPU + CPU in parallel)

	```
	GPU Pipeline (one sequence at a time):
	1. DA3 inference (30s)
	2. Feature extraction (2min)
	3. Feature matching (5min)
	Total: ~7-8 min per sequence

	CPU Pipeline (parallel workers):
	1. BA validation (5-8 min per sequence)
	2. Can run 4-8 BA jobs in parallel (CPU cores)

	Key: GPU and CPU work on different sequences simultaneously
	```

	Phase 2: Training (GPU only)
	```
	GPU: Training (hours)
	CPU: Idle (or can pre-process next batch)
	```

	### Strategy 2: Pre-Compute BA on CPU Cluster

	Use cheap CPU instances for BA:

	```
	1. Run BA on CPU cluster (spot instances, cheaper)
	- 100 sequences × 5 min = 8 hours
	- Cost: ~$10-20 (vs $100+ on GPU instance)

	2. Run DA3 + Training on GPU instance
	- DA3 inference: 50 min
	- Training: 20-40 hours
	- Cost: GPU instance time

	Total cost savings: 80-90% for BA phase
	```

	### Strategy 3: Hybrid Approach (Best for Development)

	Single GPU instance with smart scheduling:

	```python
	# Pseudo-code for optimal scheduling
	def process_sequences_optimally(sequences):
	gpu_queue = []
	cpu_queue = []

	for seq in sequences:
	# Check cache first
	if ba_cached(seq):
	# Only need DA3 inference (GPU)
	gpu_queue.append(seq)
	else:
	# Need full pipeline
	# Schedule GPU work first
	gpu_queue.append(seq)
	# Schedule BA on CPU (can run in parallel)
	cpu_queue.append(seq)

	# Process GPU queue (one at a time)
	with GPU():
	for seq in gpu_queue:
	da3_inference(seq)
	extract_features(seq)
	match_features(seq)

	# Process CPU queue (parallel workers)
	with ThreadPoolExecutor(max_workers=8):
	for seq in cpu_queue:
	run_ba(seq) # CPU-only
	```

	## Implementation Recommendations

	### 1. Separate GPU and CPU Workers

	Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.

	Recommended changes:

	```python
	# In pretrain.py
	class ARKitPretrainPipeline:
	def __init__(self, ..., gpu_device="cuda", cpu_workers=8):
	self.gpu_device = gpu_device
	self.cpu_workers = cpu_workers

	def process_arkit_sequence(self, ...):
	# GPU work
	with torch.cuda.device(self.gpu_device):
	da3_output = self.model.inference(images)
	features = extract_features_gpu(images)
	matches = match_features_gpu(features)

	# CPU work (can run in parallel with other sequences)
	ba_result = self._run_ba_cpu(images, features, matches)

	return sample
	```

	### 2. Pipeline Separation

	Separate dataset building into GPU and CPU phases:

	```python
	# Phase 1: GPU work (sequential, one sequence at a time)
	def build_dataset_gpu_phase(sequences):
	results = []
	for seq in sequences:
	if ba_cached(seq):
	# Only GPU work needed
	da3_output = model.inference(seq.images)
	results.append({
	'seq': seq,
	'da3_output': da3_output,
	'needs_ba': False
	})
	else:
	# Full GPU pipeline
	features = extract_features(seq.images)
	matches = match_features(features)
	results.append({
	'seq': seq,
	'features': features,
	'matches': matches,
	'needs_ba': True
	})
	return results

	# Phase 2: CPU work (parallel, many sequences at once)
	def build_dataset_cpu_phase(gpu_results):
	with ThreadPoolExecutor(max_workers=8) as executor:
	futures = []
	for result in gpu_results:
	if result['needs_ba']:
	future = executor.submit(
	run_ba_cpu,
	result['seq'],
	result['features'],
	result['matches']
	)
	futures.append(future)

	# Collect results
	ba_results = [f.result() for f in futures]
	return ba_results
	```

	### 3. Resource-Aware Scheduling

	Schedule based on resource availability:

	```python
	class ResourceAwareScheduler:
	def __init__(self, gpu_device="cuda", cpu_workers=8):
	self.gpu_queue = Queue()
	self.cpu_queue = Queue()
	self.gpu_busy = False
	self.cpu_slots = cpu_workers

	def schedule(self, sequence):
	if ba_cached(sequence):
	# Only GPU work
	self.gpu_queue.put(('inference_only', sequence))
	else:
	# GPU work first
	self.gpu_queue.put(('full_pipeline', sequence))
	# Then CPU work
	self.cpu_queue.put(('ba', sequence))

	def process(self):
	# GPU worker (sequential)
	while not self.gpu_queue.empty():
	task_type, seq = self.gpu_queue.get()
	if task_type == 'inference_only':
	da3_inference(seq)
	else:
	features = extract_features(seq)
	matches = match_features(features)
	self.cpu_queue.put(('ba_with_data', seq, features, matches))

	# CPU workers (parallel)
	with ThreadPoolExecutor(max_workers=self.cpu_slots):
	while not self.cpu_queue.empty():
	task = self.cpu_queue.get()
	if task[0] == 'ba':
	run_ba(task[1])
	else:
	run_ba(task[1], task[2], task[3])
	```

	## Cost Optimization Strategies

	### 1. Use Spot Instances for BA

	BA is CPU-only and can run on cheap spot instances:

	```bash
	# Run BA on spot instance (10x cheaper)
	aws ec2 run-instances \
	--instance-type c5.4xlarge \
	--spot-price 0.10 \
	--instance-market-options file://spot-config.json

	# Cost: ~$0.10/hour vs $1.00/hour for GPU
	# 100 sequences × 5 min = 8 hours = $0.80 vs $8.00
	```

	### 2. Pre-Compute BA Offline

	Run BA on local machine or cheap CPU cluster:

	```bash
	# On local machine or CPU cluster
	ylff train pretrain data/arkit_sequences \
	--epochs 0 \ # Just build dataset
	--num-workers 8 \
	--cache-dir data/pretrain_cache

	# Then train on GPU instance (expensive)
	ylff train pretrain data/arkit_sequences \
	--epochs 20 \
	--cache-dir data/pretrain_cache # Uses cached BA
	```

	### 3. Hybrid Cloud Strategy

	Use different instance types for different phases:

	```
	Phase 1: Dataset Building
	- GPU instance (1x): DA3 inference, feature extraction/matching
	- CPU instance (8x spot): BA validation
	- Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr
	- Time: 2-3 hours
	- Total: ~$8-10

	Phase 2: Training
	- GPU instance (1x): Training only
	- Cost: $2/hr × 20 hours = $40
	- Total: $40

	Total: $50 (vs $100+ if all on GPU)
	```

	## Recommended Implementation

	### For Development (Single Machine):

	```python
	# Optimal single-machine setup
	class OptimalPretrainPipeline:
	def __init__(self, gpu_device="cuda", cpu_workers=8):
	self.gpu_device = gpu_device
	self.cpu_workers = cpu_workers

	def build_dataset(self, sequences):
	# Separate GPU and CPU work
	gpu_results = []
	cpu_tasks = []

	# Phase 1: GPU work (sequential)
	for seq in sequences:
	if self._is_ba_cached(seq):
	# Only inference needed
	result = self._run_da3_inference(seq)
	gpu_results.append(result)
	else:
	# Full GPU pipeline
	result = self._run_gpu_pipeline(seq)
	gpu_results.append(result)
	cpu_tasks.append(result) # Needs BA

	# Phase 2: CPU work (parallel)
	with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor:
	ba_futures = {
	executor.submit(self._run_ba_cpu, task): task
	for task in cpu_tasks
	}

	# Collect BA results
	for future in as_completed(ba_futures):
	ba_result = future.result()
	# Merge with GPU result
	self._merge_results(ba_futures[future], ba_result)

	return gpu_results
	```

	### For Production (Distributed):

	```python
	# Distributed setup
	class DistributedPretrainPipeline:
	def __init__(self, gpu_nodes=1, cpu_nodes=8):
	self.gpu_nodes = gpu_nodes
	self.cpu_nodes = cpu_nodes

	def build_dataset(self, sequences):
	# GPU nodes: DA3 inference, features, matching
	gpu_tasks = self._distribute_gpu_work(sequences)

	# CPU nodes: BA validation (parallel)
	cpu_tasks = self._distribute_cpu_work(sequences)

	# Collect and merge
	return self._collect_results(gpu_tasks, cpu_tasks)
	```

	## Performance Comparison

	### Current (Sequential):
	```
	100 sequences:
	GPU: 7 min × 100 = 700 min (11.7 hours)
	CPU: 5 min × 100 = 500 min (8.3 hours)
	Total: 20 hours (sequential)
	GPU utilization: 50%
	CPU utilization: 50%
	```

	### Optimized (Parallel):
	```
	100 sequences:
	GPU: 7 min × 100 = 700 min (11.7 hours) [sequential]
	CPU: 5 min × 100 / 8 workers = 62.5 min (1 hour) [parallel]
	Total: 12.7 hours (overlapped)
	GPU utilization: 90%
	CPU utilization: 90%
	Speedup: 1.6x
	```

	### With Caching:
	```
	100 sequences (50 cached):
	GPU: 7 min × 50 = 350 min (5.8 hours)
	CPU: 5 min × 50 / 8 workers = 31 min [parallel]
	Total: 6 hours
	Speedup: 3.3x vs sequential
	```

	## Recommendations

	### Immediate Actions:

	1. ✅ Separate GPU and CPU work in pipeline
	2. ✅ Use parallel CPU workers for BA (already done)
	3. ✅ Pre-compute BA on cheap CPU instances
	4. ✅ Cache aggressively (already done)

	### Future Optimizations:

	1. Batch GPU operations: Process multiple sequences on GPU simultaneously
	2. Pipeline overlap: Start CPU BA while GPU processes next sequence
	3. Distributed BA: Run BA on multiple CPU nodes
	4. GPU feature extraction: Ensure SuperPoint/LightGlue use GPU

	### Cost Savings:

	- Current: All on GPU instance = $100-200 for 100 sequences
	- Optimized: GPU + CPU spot = $20-40 for 100 sequences
	- Savings: 80-90% reduction in compute costs