File size: 12,069 Bytes
7a87926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# GPU/CPU Optimal Placement Guide

## Overview

GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost.

## Component Analysis

### GPU-Accelerated Operations

| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **DA3 Inference** | 10-30 sec | N/A | βœ… Must be GPU (PyTorch model) |
| **Feature Extraction (SuperPoint)** | 1-2 min | N/A | βœ… Can be GPU (via hloc) |
| **Feature Matching (LightGlue)** | 2-5 min | N/A | βœ… Can be GPU (via hloc) |
| **Training** | Hours | N/A | βœ… Must be GPU (PyTorch) |

### CPU-Only Operations

| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **COLMAP BA** | N/A | 2-8 min | ❌ CPU-only (no GPU support) |
| **Early Filtering** | N/A | <1 sec | βœ… CPU (negligible) |
| **Data Loading** | N/A | <1 sec | βœ… CPU (I/O bound) |
| **Cache Operations** | N/A | <1 sec | βœ… CPU (I/O bound) |

## Current Pipeline Flow

### Sequential (Inefficient):
```
Sequence 1:
  GPU: DA3 inference (30s) β†’ Feature extraction (2min) β†’ Matching (5min)
  CPU: BA (5min) [waits for GPU]
  Total: ~12 min

Sequence 2:
  GPU: DA3 inference (30s) β†’ Feature extraction (2min) β†’ Matching (5min)
  CPU: BA (5min) [waits for GPU]
  Total: ~12 min

Total: 24 min (GPU idle during BA, CPU idle during GPU ops)
```

### Optimized Pipeline:

```
Parallel Execution:
  GPU: DA3 inference (Sequence 1) β†’ Feature extraction β†’ Matching
  CPU: BA (Sequence 2) [from cache or previous run]

  GPU: DA3 inference (Sequence 2) β†’ Feature extraction β†’ Matching
  CPU: BA (Sequence 3) [from cache or previous run]

  GPU: Training (batched)
  CPU: BA (other sequences in parallel)
```

## Optimal Placement Strategy

### Strategy 1: Separate GPU and CPU Workflows (Recommended)

**Phase 1: Dataset Building (GPU + CPU in parallel)**

```
GPU Pipeline (one sequence at a time):
  1. DA3 inference (30s)
  2. Feature extraction (2min)
  3. Feature matching (5min)
  Total: ~7-8 min per sequence

CPU Pipeline (parallel workers):
  1. BA validation (5-8 min per sequence)
  2. Can run 4-8 BA jobs in parallel (CPU cores)

Key: GPU and CPU work on different sequences simultaneously
```

**Phase 2: Training (GPU only)**
```
GPU: Training (hours)
CPU: Idle (or can pre-process next batch)
```

### Strategy 2: Pre-Compute BA on CPU Cluster

**Use cheap CPU instances for BA:**

```
1. Run BA on CPU cluster (spot instances, cheaper)
   - 100 sequences Γ— 5 min = 8 hours
   - Cost: ~$10-20 (vs $100+ on GPU instance)

2. Run DA3 + Training on GPU instance
   - DA3 inference: 50 min
   - Training: 20-40 hours
   - Cost: GPU instance time

Total cost savings: 80-90% for BA phase
```

### Strategy 3: Hybrid Approach (Best for Development)

**Single GPU instance with smart scheduling:**

```python
# Pseudo-code for optimal scheduling
def process_sequences_optimally(sequences):
    gpu_queue = []
    cpu_queue = []

    for seq in sequences:
        # Check cache first
        if ba_cached(seq):
            # Only need DA3 inference (GPU)
            gpu_queue.append(seq)
        else:
            # Need full pipeline
            # Schedule GPU work first
            gpu_queue.append(seq)
            # Schedule BA on CPU (can run in parallel)
            cpu_queue.append(seq)

    # Process GPU queue (one at a time)
    with GPU():
        for seq in gpu_queue:
            da3_inference(seq)
            extract_features(seq)
            match_features(seq)

    # Process CPU queue (parallel workers)
    with ThreadPoolExecutor(max_workers=8):
        for seq in cpu_queue:
            run_ba(seq)  # CPU-only
```

## Implementation Recommendations

### 1. Separate GPU and CPU Workers

**Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.**

**Recommended changes:**

```python
# In pretrain.py
class ARKitPretrainPipeline:
    def __init__(self, ..., gpu_device="cuda", cpu_workers=8):
        self.gpu_device = gpu_device
        self.cpu_workers = cpu_workers

    def process_arkit_sequence(self, ...):
        # GPU work
        with torch.cuda.device(self.gpu_device):
            da3_output = self.model.inference(images)
            features = extract_features_gpu(images)
            matches = match_features_gpu(features)

        # CPU work (can run in parallel with other sequences)
        ba_result = self._run_ba_cpu(images, features, matches)

        return sample
```

### 2. Pipeline Separation

**Separate dataset building into GPU and CPU phases:**

```python
# Phase 1: GPU work (sequential, one sequence at a time)
def build_dataset_gpu_phase(sequences):
    results = []
    for seq in sequences:
        if ba_cached(seq):
            # Only GPU work needed
            da3_output = model.inference(seq.images)
            results.append({
                'seq': seq,
                'da3_output': da3_output,
                'needs_ba': False
            })
        else:
            # Full GPU pipeline
            features = extract_features(seq.images)
            matches = match_features(features)
            results.append({
                'seq': seq,
                'features': features,
                'matches': matches,
                'needs_ba': True
            })
    return results

# Phase 2: CPU work (parallel, many sequences at once)
def build_dataset_cpu_phase(gpu_results):
    with ThreadPoolExecutor(max_workers=8) as executor:
        futures = []
        for result in gpu_results:
            if result['needs_ba']:
                future = executor.submit(
                    run_ba_cpu,
                    result['seq'],
                    result['features'],
                    result['matches']
                )
                futures.append(future)

        # Collect results
        ba_results = [f.result() for f in futures]
    return ba_results
```

### 3. Resource-Aware Scheduling

**Schedule based on resource availability:**

```python
class ResourceAwareScheduler:
    def __init__(self, gpu_device="cuda", cpu_workers=8):
        self.gpu_queue = Queue()
        self.cpu_queue = Queue()
        self.gpu_busy = False
        self.cpu_slots = cpu_workers

    def schedule(self, sequence):
        if ba_cached(sequence):
            # Only GPU work
            self.gpu_queue.put(('inference_only', sequence))
        else:
            # GPU work first
            self.gpu_queue.put(('full_pipeline', sequence))
            # Then CPU work
            self.cpu_queue.put(('ba', sequence))

    def process(self):
        # GPU worker (sequential)
        while not self.gpu_queue.empty():
            task_type, seq = self.gpu_queue.get()
            if task_type == 'inference_only':
                da3_inference(seq)
            else:
                features = extract_features(seq)
                matches = match_features(features)
                self.cpu_queue.put(('ba_with_data', seq, features, matches))

        # CPU workers (parallel)
        with ThreadPoolExecutor(max_workers=self.cpu_slots):
            while not self.cpu_queue.empty():
                task = self.cpu_queue.get()
                if task[0] == 'ba':
                    run_ba(task[1])
                else:
                    run_ba(task[1], task[2], task[3])
```

## Cost Optimization Strategies

### 1. Use Spot Instances for BA

**BA is CPU-only and can run on cheap spot instances:**

```bash
# Run BA on spot instance (10x cheaper)
aws ec2 run-instances \
    --instance-type c5.4xlarge \
    --spot-price 0.10 \
    --instance-market-options file://spot-config.json

# Cost: ~$0.10/hour vs $1.00/hour for GPU
# 100 sequences Γ— 5 min = 8 hours = $0.80 vs $8.00
```

### 2. Pre-Compute BA Offline

**Run BA on local machine or cheap CPU cluster:**

```bash
# On local machine or CPU cluster
ylff train pretrain data/arkit_sequences \
    --epochs 0 \  # Just build dataset
    --num-workers 8 \
    --cache-dir data/pretrain_cache

# Then train on GPU instance (expensive)
ylff train pretrain data/arkit_sequences \
    --epochs 20 \
    --cache-dir data/pretrain_cache  # Uses cached BA
```

### 3. Hybrid Cloud Strategy

**Use different instance types for different phases:**

```
Phase 1: Dataset Building
  - GPU instance (1x): DA3 inference, feature extraction/matching
  - CPU instance (8x spot): BA validation
  - Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr
  - Time: 2-3 hours
  - Total: ~$8-10

Phase 2: Training
  - GPU instance (1x): Training only
  - Cost: $2/hr Γ— 20 hours = $40
  - Total: $40

Total: $50 (vs $100+ if all on GPU)
```

## Recommended Implementation

### For Development (Single Machine):

```python
# Optimal single-machine setup
class OptimalPretrainPipeline:
    def __init__(self, gpu_device="cuda", cpu_workers=8):
        self.gpu_device = gpu_device
        self.cpu_workers = cpu_workers

    def build_dataset(self, sequences):
        # Separate GPU and CPU work
        gpu_results = []
        cpu_tasks = []

        # Phase 1: GPU work (sequential)
        for seq in sequences:
            if self._is_ba_cached(seq):
                # Only inference needed
                result = self._run_da3_inference(seq)
                gpu_results.append(result)
            else:
                # Full GPU pipeline
                result = self._run_gpu_pipeline(seq)
                gpu_results.append(result)
                cpu_tasks.append(result)  # Needs BA

        # Phase 2: CPU work (parallel)
        with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor:
            ba_futures = {
                executor.submit(self._run_ba_cpu, task): task
                for task in cpu_tasks
            }

            # Collect BA results
            for future in as_completed(ba_futures):
                ba_result = future.result()
                # Merge with GPU result
                self._merge_results(ba_futures[future], ba_result)

        return gpu_results
```

### For Production (Distributed):

```python
# Distributed setup
class DistributedPretrainPipeline:
    def __init__(self, gpu_nodes=1, cpu_nodes=8):
        self.gpu_nodes = gpu_nodes
        self.cpu_nodes = cpu_nodes

    def build_dataset(self, sequences):
        # GPU nodes: DA3 inference, features, matching
        gpu_tasks = self._distribute_gpu_work(sequences)

        # CPU nodes: BA validation (parallel)
        cpu_tasks = self._distribute_cpu_work(sequences)

        # Collect and merge
        return self._collect_results(gpu_tasks, cpu_tasks)
```

## Performance Comparison

### Current (Sequential):
```
100 sequences:
  GPU: 7 min Γ— 100 = 700 min (11.7 hours)
  CPU: 5 min Γ— 100 = 500 min (8.3 hours)
  Total: 20 hours (sequential)
  GPU utilization: 50%
  CPU utilization: 50%
```

### Optimized (Parallel):
```
100 sequences:
  GPU: 7 min Γ— 100 = 700 min (11.7 hours) [sequential]
  CPU: 5 min Γ— 100 / 8 workers = 62.5 min (1 hour) [parallel]
  Total: 12.7 hours (overlapped)
  GPU utilization: 90%
  CPU utilization: 90%
  Speedup: 1.6x
```

### With Caching:
```
100 sequences (50 cached):
  GPU: 7 min Γ— 50 = 350 min (5.8 hours)
  CPU: 5 min Γ— 50 / 8 workers = 31 min [parallel]
  Total: 6 hours
  Speedup: 3.3x vs sequential
```

## Recommendations

### Immediate Actions:

1. βœ… **Separate GPU and CPU work** in pipeline
2. βœ… **Use parallel CPU workers** for BA (already done)
3. βœ… **Pre-compute BA** on cheap CPU instances
4. βœ… **Cache aggressively** (already done)

### Future Optimizations:

1. **Batch GPU operations**: Process multiple sequences on GPU simultaneously
2. **Pipeline overlap**: Start CPU BA while GPU processes next sequence
3. **Distributed BA**: Run BA on multiple CPU nodes
4. **GPU feature extraction**: Ensure SuperPoint/LightGlue use GPU

### Cost Savings:

- **Current**: All on GPU instance = $100-200 for 100 sequences
- **Optimized**: GPU + CPU spot = $20-40 for 100 sequences
- **Savings**: 80-90% reduction in compute costs