3d_model / docs /GPU_CPU_PLACEMENT.md
Azan
Clean deployment build (Squashed)
7a87926
# GPU/CPU Optimal Placement Guide
## Overview
GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost.
## Component Analysis
### GPU-Accelerated Operations
| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **DA3 Inference** | 10-30 sec | N/A | βœ… Must be GPU (PyTorch model) |
| **Feature Extraction (SuperPoint)** | 1-2 min | N/A | βœ… Can be GPU (via hloc) |
| **Feature Matching (LightGlue)** | 2-5 min | N/A | βœ… Can be GPU (via hloc) |
| **Training** | Hours | N/A | βœ… Must be GPU (PyTorch) |
### CPU-Only Operations
| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **COLMAP BA** | N/A | 2-8 min | ❌ CPU-only (no GPU support) |
| **Early Filtering** | N/A | <1 sec | βœ… CPU (negligible) |
| **Data Loading** | N/A | <1 sec | βœ… CPU (I/O bound) |
| **Cache Operations** | N/A | <1 sec | βœ… CPU (I/O bound) |
## Current Pipeline Flow
### Sequential (Inefficient):
```
Sequence 1:
GPU: DA3 inference (30s) β†’ Feature extraction (2min) β†’ Matching (5min)
CPU: BA (5min) [waits for GPU]
Total: ~12 min
Sequence 2:
GPU: DA3 inference (30s) β†’ Feature extraction (2min) β†’ Matching (5min)
CPU: BA (5min) [waits for GPU]
Total: ~12 min
Total: 24 min (GPU idle during BA, CPU idle during GPU ops)
```
### Optimized Pipeline:
```
Parallel Execution:
GPU: DA3 inference (Sequence 1) β†’ Feature extraction β†’ Matching
CPU: BA (Sequence 2) [from cache or previous run]
GPU: DA3 inference (Sequence 2) β†’ Feature extraction β†’ Matching
CPU: BA (Sequence 3) [from cache or previous run]
GPU: Training (batched)
CPU: BA (other sequences in parallel)
```
## Optimal Placement Strategy
### Strategy 1: Separate GPU and CPU Workflows (Recommended)
**Phase 1: Dataset Building (GPU + CPU in parallel)**
```
GPU Pipeline (one sequence at a time):
1. DA3 inference (30s)
2. Feature extraction (2min)
3. Feature matching (5min)
Total: ~7-8 min per sequence
CPU Pipeline (parallel workers):
1. BA validation (5-8 min per sequence)
2. Can run 4-8 BA jobs in parallel (CPU cores)
Key: GPU and CPU work on different sequences simultaneously
```
**Phase 2: Training (GPU only)**
```
GPU: Training (hours)
CPU: Idle (or can pre-process next batch)
```
### Strategy 2: Pre-Compute BA on CPU Cluster
**Use cheap CPU instances for BA:**
```
1. Run BA on CPU cluster (spot instances, cheaper)
- 100 sequences Γ— 5 min = 8 hours
- Cost: ~$10-20 (vs $100+ on GPU instance)
2. Run DA3 + Training on GPU instance
- DA3 inference: 50 min
- Training: 20-40 hours
- Cost: GPU instance time
Total cost savings: 80-90% for BA phase
```
### Strategy 3: Hybrid Approach (Best for Development)
**Single GPU instance with smart scheduling:**
```python
# Pseudo-code for optimal scheduling
def process_sequences_optimally(sequences):
gpu_queue = []
cpu_queue = []
for seq in sequences:
# Check cache first
if ba_cached(seq):
# Only need DA3 inference (GPU)
gpu_queue.append(seq)
else:
# Need full pipeline
# Schedule GPU work first
gpu_queue.append(seq)
# Schedule BA on CPU (can run in parallel)
cpu_queue.append(seq)
# Process GPU queue (one at a time)
with GPU():
for seq in gpu_queue:
da3_inference(seq)
extract_features(seq)
match_features(seq)
# Process CPU queue (parallel workers)
with ThreadPoolExecutor(max_workers=8):
for seq in cpu_queue:
run_ba(seq) # CPU-only
```
## Implementation Recommendations
### 1. Separate GPU and CPU Workers
**Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.**
**Recommended changes:**
```python
# In pretrain.py
class ARKitPretrainPipeline:
def __init__(self, ..., gpu_device="cuda", cpu_workers=8):
self.gpu_device = gpu_device
self.cpu_workers = cpu_workers
def process_arkit_sequence(self, ...):
# GPU work
with torch.cuda.device(self.gpu_device):
da3_output = self.model.inference(images)
features = extract_features_gpu(images)
matches = match_features_gpu(features)
# CPU work (can run in parallel with other sequences)
ba_result = self._run_ba_cpu(images, features, matches)
return sample
```
### 2. Pipeline Separation
**Separate dataset building into GPU and CPU phases:**
```python
# Phase 1: GPU work (sequential, one sequence at a time)
def build_dataset_gpu_phase(sequences):
results = []
for seq in sequences:
if ba_cached(seq):
# Only GPU work needed
da3_output = model.inference(seq.images)
results.append({
'seq': seq,
'da3_output': da3_output,
'needs_ba': False
})
else:
# Full GPU pipeline
features = extract_features(seq.images)
matches = match_features(features)
results.append({
'seq': seq,
'features': features,
'matches': matches,
'needs_ba': True
})
return results
# Phase 2: CPU work (parallel, many sequences at once)
def build_dataset_cpu_phase(gpu_results):
with ThreadPoolExecutor(max_workers=8) as executor:
futures = []
for result in gpu_results:
if result['needs_ba']:
future = executor.submit(
run_ba_cpu,
result['seq'],
result['features'],
result['matches']
)
futures.append(future)
# Collect results
ba_results = [f.result() for f in futures]
return ba_results
```
### 3. Resource-Aware Scheduling
**Schedule based on resource availability:**
```python
class ResourceAwareScheduler:
def __init__(self, gpu_device="cuda", cpu_workers=8):
self.gpu_queue = Queue()
self.cpu_queue = Queue()
self.gpu_busy = False
self.cpu_slots = cpu_workers
def schedule(self, sequence):
if ba_cached(sequence):
# Only GPU work
self.gpu_queue.put(('inference_only', sequence))
else:
# GPU work first
self.gpu_queue.put(('full_pipeline', sequence))
# Then CPU work
self.cpu_queue.put(('ba', sequence))
def process(self):
# GPU worker (sequential)
while not self.gpu_queue.empty():
task_type, seq = self.gpu_queue.get()
if task_type == 'inference_only':
da3_inference(seq)
else:
features = extract_features(seq)
matches = match_features(features)
self.cpu_queue.put(('ba_with_data', seq, features, matches))
# CPU workers (parallel)
with ThreadPoolExecutor(max_workers=self.cpu_slots):
while not self.cpu_queue.empty():
task = self.cpu_queue.get()
if task[0] == 'ba':
run_ba(task[1])
else:
run_ba(task[1], task[2], task[3])
```
## Cost Optimization Strategies
### 1. Use Spot Instances for BA
**BA is CPU-only and can run on cheap spot instances:**
```bash
# Run BA on spot instance (10x cheaper)
aws ec2 run-instances \
--instance-type c5.4xlarge \
--spot-price 0.10 \
--instance-market-options file://spot-config.json
# Cost: ~$0.10/hour vs $1.00/hour for GPU
# 100 sequences Γ— 5 min = 8 hours = $0.80 vs $8.00
```
### 2. Pre-Compute BA Offline
**Run BA on local machine or cheap CPU cluster:**
```bash
# On local machine or CPU cluster
ylff train pretrain data/arkit_sequences \
--epochs 0 \ # Just build dataset
--num-workers 8 \
--cache-dir data/pretrain_cache
# Then train on GPU instance (expensive)
ylff train pretrain data/arkit_sequences \
--epochs 20 \
--cache-dir data/pretrain_cache # Uses cached BA
```
### 3. Hybrid Cloud Strategy
**Use different instance types for different phases:**
```
Phase 1: Dataset Building
- GPU instance (1x): DA3 inference, feature extraction/matching
- CPU instance (8x spot): BA validation
- Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr
- Time: 2-3 hours
- Total: ~$8-10
Phase 2: Training
- GPU instance (1x): Training only
- Cost: $2/hr Γ— 20 hours = $40
- Total: $40
Total: $50 (vs $100+ if all on GPU)
```
## Recommended Implementation
### For Development (Single Machine):
```python
# Optimal single-machine setup
class OptimalPretrainPipeline:
def __init__(self, gpu_device="cuda", cpu_workers=8):
self.gpu_device = gpu_device
self.cpu_workers = cpu_workers
def build_dataset(self, sequences):
# Separate GPU and CPU work
gpu_results = []
cpu_tasks = []
# Phase 1: GPU work (sequential)
for seq in sequences:
if self._is_ba_cached(seq):
# Only inference needed
result = self._run_da3_inference(seq)
gpu_results.append(result)
else:
# Full GPU pipeline
result = self._run_gpu_pipeline(seq)
gpu_results.append(result)
cpu_tasks.append(result) # Needs BA
# Phase 2: CPU work (parallel)
with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor:
ba_futures = {
executor.submit(self._run_ba_cpu, task): task
for task in cpu_tasks
}
# Collect BA results
for future in as_completed(ba_futures):
ba_result = future.result()
# Merge with GPU result
self._merge_results(ba_futures[future], ba_result)
return gpu_results
```
### For Production (Distributed):
```python
# Distributed setup
class DistributedPretrainPipeline:
def __init__(self, gpu_nodes=1, cpu_nodes=8):
self.gpu_nodes = gpu_nodes
self.cpu_nodes = cpu_nodes
def build_dataset(self, sequences):
# GPU nodes: DA3 inference, features, matching
gpu_tasks = self._distribute_gpu_work(sequences)
# CPU nodes: BA validation (parallel)
cpu_tasks = self._distribute_cpu_work(sequences)
# Collect and merge
return self._collect_results(gpu_tasks, cpu_tasks)
```
## Performance Comparison
### Current (Sequential):
```
100 sequences:
GPU: 7 min Γ— 100 = 700 min (11.7 hours)
CPU: 5 min Γ— 100 = 500 min (8.3 hours)
Total: 20 hours (sequential)
GPU utilization: 50%
CPU utilization: 50%
```
### Optimized (Parallel):
```
100 sequences:
GPU: 7 min Γ— 100 = 700 min (11.7 hours) [sequential]
CPU: 5 min Γ— 100 / 8 workers = 62.5 min (1 hour) [parallel]
Total: 12.7 hours (overlapped)
GPU utilization: 90%
CPU utilization: 90%
Speedup: 1.6x
```
### With Caching:
```
100 sequences (50 cached):
GPU: 7 min Γ— 50 = 350 min (5.8 hours)
CPU: 5 min Γ— 50 / 8 workers = 31 min [parallel]
Total: 6 hours
Speedup: 3.3x vs sequential
```
## Recommendations
### Immediate Actions:
1. βœ… **Separate GPU and CPU work** in pipeline
2. βœ… **Use parallel CPU workers** for BA (already done)
3. βœ… **Pre-compute BA** on cheap CPU instances
4. βœ… **Cache aggressively** (already done)
### Future Optimizations:
1. **Batch GPU operations**: Process multiple sequences on GPU simultaneously
2. **Pipeline overlap**: Start CPU BA while GPU processes next sequence
3. **Distributed BA**: Run BA on multiple CPU nodes
4. **GPU feature extraction**: Ensure SuperPoint/LightGlue use GPU
### Cost Savings:
- **Current**: All on GPU instance = $100-200 for 100 sequences
- **Optimized**: GPU + CPU spot = $20-40 for 100 sequences
- **Savings**: 80-90% reduction in compute costs