3d_model / docs /GPU_CPU_PLACEMENT.md
Azan
Clean deployment build (Squashed)
7a87926

GPU/CPU Optimal Placement Guide

Overview

GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost.

Component Analysis

GPU-Accelerated Operations

Component GPU Time CPU Time Notes
DA3 Inference 10-30 sec N/A βœ… Must be GPU (PyTorch model)
Feature Extraction (SuperPoint) 1-2 min N/A βœ… Can be GPU (via hloc)
Feature Matching (LightGlue) 2-5 min N/A βœ… Can be GPU (via hloc)
Training Hours N/A βœ… Must be GPU (PyTorch)

CPU-Only Operations

Component GPU Time CPU Time Notes
COLMAP BA N/A 2-8 min ❌ CPU-only (no GPU support)
Early Filtering N/A <1 sec βœ… CPU (negligible)
Data Loading N/A <1 sec βœ… CPU (I/O bound)
Cache Operations N/A <1 sec βœ… CPU (I/O bound)

Current Pipeline Flow

Sequential (Inefficient):

Sequence 1:
  GPU: DA3 inference (30s) β†’ Feature extraction (2min) β†’ Matching (5min)
  CPU: BA (5min) [waits for GPU]
  Total: ~12 min

Sequence 2:
  GPU: DA3 inference (30s) β†’ Feature extraction (2min) β†’ Matching (5min)
  CPU: BA (5min) [waits for GPU]
  Total: ~12 min

Total: 24 min (GPU idle during BA, CPU idle during GPU ops)

Optimized Pipeline:

Parallel Execution:
  GPU: DA3 inference (Sequence 1) β†’ Feature extraction β†’ Matching
  CPU: BA (Sequence 2) [from cache or previous run]

  GPU: DA3 inference (Sequence 2) β†’ Feature extraction β†’ Matching
  CPU: BA (Sequence 3) [from cache or previous run]

  GPU: Training (batched)
  CPU: BA (other sequences in parallel)

Optimal Placement Strategy

Strategy 1: Separate GPU and CPU Workflows (Recommended)

Phase 1: Dataset Building (GPU + CPU in parallel)

GPU Pipeline (one sequence at a time):
  1. DA3 inference (30s)
  2. Feature extraction (2min)
  3. Feature matching (5min)
  Total: ~7-8 min per sequence

CPU Pipeline (parallel workers):
  1. BA validation (5-8 min per sequence)
  2. Can run 4-8 BA jobs in parallel (CPU cores)

Key: GPU and CPU work on different sequences simultaneously

Phase 2: Training (GPU only)

GPU: Training (hours)
CPU: Idle (or can pre-process next batch)

Strategy 2: Pre-Compute BA on CPU Cluster

Use cheap CPU instances for BA:

1. Run BA on CPU cluster (spot instances, cheaper)
   - 100 sequences Γ— 5 min = 8 hours
   - Cost: ~$10-20 (vs $100+ on GPU instance)

2. Run DA3 + Training on GPU instance
   - DA3 inference: 50 min
   - Training: 20-40 hours
   - Cost: GPU instance time

Total cost savings: 80-90% for BA phase

Strategy 3: Hybrid Approach (Best for Development)

Single GPU instance with smart scheduling:

# Pseudo-code for optimal scheduling
def process_sequences_optimally(sequences):
    gpu_queue = []
    cpu_queue = []

    for seq in sequences:
        # Check cache first
        if ba_cached(seq):
            # Only need DA3 inference (GPU)
            gpu_queue.append(seq)
        else:
            # Need full pipeline
            # Schedule GPU work first
            gpu_queue.append(seq)
            # Schedule BA on CPU (can run in parallel)
            cpu_queue.append(seq)

    # Process GPU queue (one at a time)
    with GPU():
        for seq in gpu_queue:
            da3_inference(seq)
            extract_features(seq)
            match_features(seq)

    # Process CPU queue (parallel workers)
    with ThreadPoolExecutor(max_workers=8):
        for seq in cpu_queue:
            run_ba(seq)  # CPU-only

Implementation Recommendations

1. Separate GPU and CPU Workers

Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.

Recommended changes:

# In pretrain.py
class ARKitPretrainPipeline:
    def __init__(self, ..., gpu_device="cuda", cpu_workers=8):
        self.gpu_device = gpu_device
        self.cpu_workers = cpu_workers

    def process_arkit_sequence(self, ...):
        # GPU work
        with torch.cuda.device(self.gpu_device):
            da3_output = self.model.inference(images)
            features = extract_features_gpu(images)
            matches = match_features_gpu(features)

        # CPU work (can run in parallel with other sequences)
        ba_result = self._run_ba_cpu(images, features, matches)

        return sample

2. Pipeline Separation

Separate dataset building into GPU and CPU phases:

# Phase 1: GPU work (sequential, one sequence at a time)
def build_dataset_gpu_phase(sequences):
    results = []
    for seq in sequences:
        if ba_cached(seq):
            # Only GPU work needed
            da3_output = model.inference(seq.images)
            results.append({
                'seq': seq,
                'da3_output': da3_output,
                'needs_ba': False
            })
        else:
            # Full GPU pipeline
            features = extract_features(seq.images)
            matches = match_features(features)
            results.append({
                'seq': seq,
                'features': features,
                'matches': matches,
                'needs_ba': True
            })
    return results

# Phase 2: CPU work (parallel, many sequences at once)
def build_dataset_cpu_phase(gpu_results):
    with ThreadPoolExecutor(max_workers=8) as executor:
        futures = []
        for result in gpu_results:
            if result['needs_ba']:
                future = executor.submit(
                    run_ba_cpu,
                    result['seq'],
                    result['features'],
                    result['matches']
                )
                futures.append(future)

        # Collect results
        ba_results = [f.result() for f in futures]
    return ba_results

3. Resource-Aware Scheduling

Schedule based on resource availability:

class ResourceAwareScheduler:
    def __init__(self, gpu_device="cuda", cpu_workers=8):
        self.gpu_queue = Queue()
        self.cpu_queue = Queue()
        self.gpu_busy = False
        self.cpu_slots = cpu_workers

    def schedule(self, sequence):
        if ba_cached(sequence):
            # Only GPU work
            self.gpu_queue.put(('inference_only', sequence))
        else:
            # GPU work first
            self.gpu_queue.put(('full_pipeline', sequence))
            # Then CPU work
            self.cpu_queue.put(('ba', sequence))

    def process(self):
        # GPU worker (sequential)
        while not self.gpu_queue.empty():
            task_type, seq = self.gpu_queue.get()
            if task_type == 'inference_only':
                da3_inference(seq)
            else:
                features = extract_features(seq)
                matches = match_features(features)
                self.cpu_queue.put(('ba_with_data', seq, features, matches))

        # CPU workers (parallel)
        with ThreadPoolExecutor(max_workers=self.cpu_slots):
            while not self.cpu_queue.empty():
                task = self.cpu_queue.get()
                if task[0] == 'ba':
                    run_ba(task[1])
                else:
                    run_ba(task[1], task[2], task[3])

Cost Optimization Strategies

1. Use Spot Instances for BA

BA is CPU-only and can run on cheap spot instances:

# Run BA on spot instance (10x cheaper)
aws ec2 run-instances \
    --instance-type c5.4xlarge \
    --spot-price 0.10 \
    --instance-market-options file://spot-config.json

# Cost: ~$0.10/hour vs $1.00/hour for GPU
# 100 sequences Γ— 5 min = 8 hours = $0.80 vs $8.00

2. Pre-Compute BA Offline

Run BA on local machine or cheap CPU cluster:

# On local machine or CPU cluster
ylff train pretrain data/arkit_sequences \
    --epochs 0 \  # Just build dataset
    --num-workers 8 \
    --cache-dir data/pretrain_cache

# Then train on GPU instance (expensive)
ylff train pretrain data/arkit_sequences \
    --epochs 20 \
    --cache-dir data/pretrain_cache  # Uses cached BA

3. Hybrid Cloud Strategy

Use different instance types for different phases:

Phase 1: Dataset Building
  - GPU instance (1x): DA3 inference, feature extraction/matching
  - CPU instance (8x spot): BA validation
  - Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr
  - Time: 2-3 hours
  - Total: ~$8-10

Phase 2: Training
  - GPU instance (1x): Training only
  - Cost: $2/hr Γ— 20 hours = $40
  - Total: $40

Total: $50 (vs $100+ if all on GPU)

Recommended Implementation

For Development (Single Machine):

# Optimal single-machine setup
class OptimalPretrainPipeline:
    def __init__(self, gpu_device="cuda", cpu_workers=8):
        self.gpu_device = gpu_device
        self.cpu_workers = cpu_workers

    def build_dataset(self, sequences):
        # Separate GPU and CPU work
        gpu_results = []
        cpu_tasks = []

        # Phase 1: GPU work (sequential)
        for seq in sequences:
            if self._is_ba_cached(seq):
                # Only inference needed
                result = self._run_da3_inference(seq)
                gpu_results.append(result)
            else:
                # Full GPU pipeline
                result = self._run_gpu_pipeline(seq)
                gpu_results.append(result)
                cpu_tasks.append(result)  # Needs BA

        # Phase 2: CPU work (parallel)
        with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor:
            ba_futures = {
                executor.submit(self._run_ba_cpu, task): task
                for task in cpu_tasks
            }

            # Collect BA results
            for future in as_completed(ba_futures):
                ba_result = future.result()
                # Merge with GPU result
                self._merge_results(ba_futures[future], ba_result)

        return gpu_results

For Production (Distributed):

# Distributed setup
class DistributedPretrainPipeline:
    def __init__(self, gpu_nodes=1, cpu_nodes=8):
        self.gpu_nodes = gpu_nodes
        self.cpu_nodes = cpu_nodes

    def build_dataset(self, sequences):
        # GPU nodes: DA3 inference, features, matching
        gpu_tasks = self._distribute_gpu_work(sequences)

        # CPU nodes: BA validation (parallel)
        cpu_tasks = self._distribute_cpu_work(sequences)

        # Collect and merge
        return self._collect_results(gpu_tasks, cpu_tasks)

Performance Comparison

Current (Sequential):

100 sequences:
  GPU: 7 min Γ— 100 = 700 min (11.7 hours)
  CPU: 5 min Γ— 100 = 500 min (8.3 hours)
  Total: 20 hours (sequential)
  GPU utilization: 50%
  CPU utilization: 50%

Optimized (Parallel):

100 sequences:
  GPU: 7 min Γ— 100 = 700 min (11.7 hours) [sequential]
  CPU: 5 min Γ— 100 / 8 workers = 62.5 min (1 hour) [parallel]
  Total: 12.7 hours (overlapped)
  GPU utilization: 90%
  CPU utilization: 90%
  Speedup: 1.6x

With Caching:

100 sequences (50 cached):
  GPU: 7 min Γ— 50 = 350 min (5.8 hours)
  CPU: 5 min Γ— 50 / 8 workers = 31 min [parallel]
  Total: 6 hours
  Speedup: 3.3x vs sequential

Recommendations

Immediate Actions:

  1. βœ… Separate GPU and CPU work in pipeline
  2. βœ… Use parallel CPU workers for BA (already done)
  3. βœ… Pre-compute BA on cheap CPU instances
  4. βœ… Cache aggressively (already done)

Future Optimizations:

  1. Batch GPU operations: Process multiple sequences on GPU simultaneously
  2. Pipeline overlap: Start CPU BA while GPU processes next sequence
  3. Distributed BA: Run BA on multiple CPU nodes
  4. GPU feature extraction: Ensure SuperPoint/LightGlue use GPU

Cost Savings:

  • Current: All on GPU instance = $100-200 for 100 sequences
  • Optimized: GPU + CPU spot = $20-40 for 100 sequences
  • Savings: 80-90% reduction in compute costs