# GPU/CPU Optimal Placement Guide

## Overview

GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost.

## Component Analysis

### GPU-Accelerated Operations

| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **DA3 Inference** | 10-30 sec | N/A | ✅ Must be GPU (PyTorch model) |
| **Feature Extraction (SuperPoint)** | 1-2 min | N/A | ✅ Can be GPU (via hloc) |
| **Feature Matching (LightGlue)** | 2-5 min | N/A | ✅ Can be GPU (via hloc) |
| **Training** | Hours | N/A | ✅ Must be GPU (PyTorch) |

### CPU-Only Operations

| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **COLMAP BA** | N/A | 2-8 min | ❌ CPU-only (no GPU support) |
| **Early Filtering** | N/A | <1 sec | ✅ CPU (negligible) |
| **Data Loading** | N/A | <1 sec | ✅ CPU (I/O bound) |
| **Cache Operations** | N/A | <1 sec | ✅ CPU (I/O bound) |

## Current Pipeline Flow

### Sequential (Inefficient):
```
Sequence 1:
  GPU: DA3 inference (30s) → Feature extraction (2min) → Matching (5min)
  CPU: BA (5min) [waits for GPU]
  Total: ~12 min

Sequence 2:
  GPU: DA3 inference (30s) → Feature extraction (2min) → Matching (5min)
  CPU: BA (5min) [waits for GPU]
  Total: ~12 min

Total: 24 min (GPU idle during BA, CPU idle during GPU ops)
```

### Optimized Pipeline:

```
Parallel Execution:
  GPU: DA3 inference (Sequence 1) → Feature extraction → Matching
  CPU: BA (Sequence 2) [from cache or previous run]

  GPU: DA3 inference (Sequence 2) → Feature extraction → Matching
  CPU: BA (Sequence 3) [from cache or previous run]

  GPU: Training (batched)
  CPU: BA (other sequences in parallel)
```

## Optimal Placement Strategy

### Strategy 1: Separate GPU and CPU Workflows (Recommended)

**Phase 1: Dataset Building (GPU + CPU in parallel)**

```
GPU Pipeline (one sequence at a time):
  1. DA3 inference (30s)
  2. Feature extraction (2min)
  3. Feature matching (5min)
  Total: ~7-8 min per sequence

CPU Pipeline (parallel workers):
  1. BA validation (5-8 min per sequence)
  2. Can run 4-8 BA jobs in parallel (CPU cores)

Key: GPU and CPU work on different sequences simultaneously
```

**Phase 2: Training (GPU only)**
```
GPU: Training (hours)
CPU: Idle (or can pre-process next batch)
```

### Strategy 2: Pre-Compute BA on CPU Cluster

**Use cheap CPU instances for BA:**

```
1. Run BA on CPU cluster (spot instances, cheaper)
   - 100 sequences × 5 min = 8 hours
   - Cost: ~$10-20 (vs $100+ on GPU instance)

2. Run DA3 + Training on GPU instance
   - DA3 inference: 50 min
   - Training: 20-40 hours
   - Cost: GPU instance time

Total cost savings: 80-90% for BA phase
```

### Strategy 3: Hybrid Approach (Best for Development)

**Single GPU instance with smart scheduling:**

```python
# Pseudo-code for optimal scheduling
def process_sequences_optimally(sequences):
    gpu_queue = []
    cpu_queue = []

    for seq in sequences:
        # Check cache first
        if ba_cached(seq):
            # Only need DA3 inference (GPU)
            gpu_queue.append(seq)
        else:
            # Need full pipeline
            # Schedule GPU work first
            gpu_queue.append(seq)
            # Schedule BA on CPU (can run in parallel)
            cpu_queue.append(seq)

    # Process GPU queue (one at a time)
    with GPU():
        for seq in gpu_queue:
            da3_inference(seq)
            extract_features(seq)
            match_features(seq)

    # Process CPU queue (parallel workers)
    with ThreadPoolExecutor(max_workers=8):
        for seq in cpu_queue:
            run_ba(seq)  # CPU-only
```

## Implementation Recommendations

### 1. Separate GPU and CPU Workers

**Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.**

**Recommended changes:**

```python
# In pretrain.py
class ARKitPretrainPipeline:
    def __init__(self, ..., gpu_device="cuda", cpu_workers=8):
        self.gpu_device = gpu_device
        self.cpu_workers = cpu_workers

    def process_arkit_sequence(self, ...):
        # GPU work
        with torch.cuda.device(self.gpu_device):
            da3_output = self.model.inference(images)
            features = extract_features_gpu(images)
            matches = match_features_gpu(features)

        # CPU work (can run in parallel with other sequences)
        ba_result = self._run_ba_cpu(images, features, matches)

        return sample
```

### 2. Pipeline Separation

**Separate dataset building into GPU and CPU phases:**

```python
# Phase 1: GPU work (sequential, one sequence at a time)
def build_dataset_gpu_phase(sequences):
    results = []
    for seq in sequences:
        if ba_cached(seq):
            # Only GPU work needed
            da3_output = model.inference(seq.images)
            results.append({
                'seq': seq,
                'da3_output': da3_output,
                'needs_ba': False
            })
        else:
            # Full GPU pipeline
            features = extract_features(seq.images)
            matches = match_features(features)
            results.append({
                'seq': seq,
                'features': features,
                'matches': matches,
                'needs_ba': True
            })
    return results

# Phase 2: CPU work (parallel, many sequences at once)
def build_dataset_cpu_phase(gpu_results):
    with ThreadPoolExecutor(max_workers=8) as executor:
        futures = []
        for result in gpu_results:
            if result['needs_ba']:
                future = executor.submit(
                    run_ba_cpu,
                    result['seq'],
                    result['features'],
                    result['matches']
                )
                futures.append(future)

        # Collect results
        ba_results = [f.result() for f in futures]
    return ba_results
```

### 3. Resource-Aware Scheduling

**Schedule based on resource availability:**

```python
class ResourceAwareScheduler:
    def __init__(self, gpu_device="cuda", cpu_workers=8):
        self.gpu_queue = Queue()
        self.cpu_queue = Queue()
        self.gpu_busy = False
        self.cpu_slots = cpu_workers

    def schedule(self, sequence):
        if ba_cached(sequence):
            # Only GPU work
            self.gpu_queue.put(('inference_only', sequence))
        else:
            # GPU work first
            self.gpu_queue.put(('full_pipeline', sequence))
            # Then CPU work
            self.cpu_queue.put(('ba', sequence))

    def process(self):
        # GPU worker (sequential)
        while not self.gpu_queue.empty():
            task_type, seq = self.gpu_queue.get()
            if task_type == 'inference_only':
                da3_inference(seq)
            else:
                features = extract_features(seq)
                matches = match_features(features)
                self.cpu_queue.put(('ba_with_data', seq, features, matches))

        # CPU workers (parallel)
        with ThreadPoolExecutor(max_workers=self.cpu_slots):
            while not self.cpu_queue.empty():
                task = self.cpu_queue.get()
                if task[0] == 'ba':
                    run_ba(task[1])
                else:
                    run_ba(task[1], task[2], task[3])
```

## Cost Optimization Strategies

### 1. Use Spot Instances for BA

**BA is CPU-only and can run on cheap spot instances:**

```bash
# Run BA on spot instance (10x cheaper)
aws ec2 run-instances \
    --instance-type c5.4xlarge \
    --spot-price 0.10 \
    --instance-market-options file://spot-config.json

# Cost: ~$0.10/hour vs $1.00/hour for GPU
# 100 sequences × 5 min = 8 hours = $0.80 vs $8.00
```

### 2. Pre-Compute BA Offline

**Run BA on local machine or cheap CPU cluster:**

```bash
# On local machine or CPU cluster
ylff train pretrain data/arkit_sequences \
    --epochs 0 \  # Just build dataset
    --num-workers 8 \
    --cache-dir data/pretrain_cache

# Then train on GPU instance (expensive)
ylff train pretrain data/arkit_sequences \
    --epochs 20 \
    --cache-dir data/pretrain_cache  # Uses cached BA
```

### 3. Hybrid Cloud Strategy

**Use different instance types for different phases:**

```
Phase 1: Dataset Building
  - GPU instance (1x): DA3 inference, feature extraction/matching
  - CPU instance (8x spot): BA validation
  - Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr
  - Time: 2-3 hours
  - Total: ~$8-10

Phase 2: Training
  - GPU instance (1x): Training only
  - Cost: $2/hr × 20 hours = $40
  - Total: $40

Total: $50 (vs $100+ if all on GPU)
```

## Recommended Implementation

### For Development (Single Machine):

```python
# Optimal single-machine setup
class OptimalPretrainPipeline:
    def __init__(self, gpu_device="cuda", cpu_workers=8):
        self.gpu_device = gpu_device
        self.cpu_workers = cpu_workers

    def build_dataset(self, sequences):
        # Separate GPU and CPU work
        gpu_results = []
        cpu_tasks = []

        # Phase 1: GPU work (sequential)
        for seq in sequences:
            if self._is_ba_cached(seq):
                # Only inference needed
                result = self._run_da3_inference(seq)
                gpu_results.append(result)
            else:
                # Full GPU pipeline
                result = self._run_gpu_pipeline(seq)
                gpu_results.append(result)
                cpu_tasks.append(result)  # Needs BA

        # Phase 2: CPU work (parallel)
        with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor:
            ba_futures = {
                executor.submit(self._run_ba_cpu, task): task
                for task in cpu_tasks
            }

            # Collect BA results
            for future in as_completed(ba_futures):
                ba_result = future.result()
                # Merge with GPU result
                self._merge_results(ba_futures[future], ba_result)

        return gpu_results
```

### For Production (Distributed):

```python
# Distributed setup
class DistributedPretrainPipeline:
    def __init__(self, gpu_nodes=1, cpu_nodes=8):
        self.gpu_nodes = gpu_nodes
        self.cpu_nodes = cpu_nodes

    def build_dataset(self, sequences):
        # GPU nodes: DA3 inference, features, matching
        gpu_tasks = self._distribute_gpu_work(sequences)

        # CPU nodes: BA validation (parallel)
        cpu_tasks = self._distribute_cpu_work(sequences)

        # Collect and merge
        return self._collect_results(gpu_tasks, cpu_tasks)
```

## Performance Comparison

### Current (Sequential):
```
100 sequences:
  GPU: 7 min × 100 = 700 min (11.7 hours)
  CPU: 5 min × 100 = 500 min (8.3 hours)
  Total: 20 hours (sequential)
  GPU utilization: 50%
  CPU utilization: 50%
```

### Optimized (Parallel):
```
100 sequences:
  GPU: 7 min × 100 = 700 min (11.7 hours) [sequential]
  CPU: 5 min × 100 / 8 workers = 62.5 min (1 hour) [parallel]
  Total: 12.7 hours (overlapped)
  GPU utilization: 90%
  CPU utilization: 90%
  Speedup: 1.6x
```

### With Caching:
```
100 sequences (50 cached):
  GPU: 7 min × 50 = 350 min (5.8 hours)
  CPU: 5 min × 50 / 8 workers = 31 min [parallel]
  Total: 6 hours
  Speedup: 3.3x vs sequential
```

## Recommendations

### Immediate Actions:

1. ✅ **Separate GPU and CPU work** in pipeline
2. ✅ **Use parallel CPU workers** for BA (already done)
3. ✅ **Pre-compute BA** on cheap CPU instances
4. ✅ **Cache aggressively** (already done)

### Future Optimizations:

1. **Batch GPU operations**: Process multiple sequences on GPU simultaneously
2. **Pipeline overlap**: Start CPU BA while GPU processes next sequence
3. **Distributed BA**: Run BA on multiple CPU nodes
4. **GPU feature extraction**: Ensure SuperPoint/LightGlue use GPU

### Cost Savings:

- **Current**: All on GPU instance = $100-200 for 100 sequences
- **Optimized**: GPU + CPU spot = $20-40 for 100 sequences
- **Savings**: 80-90% reduction in compute costs