File size: 12,069 Bytes
7a87926 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 |
# GPU/CPU Optimal Placement Guide
## Overview
GPUs are expensive, and not all operations can be GPU-accelerated. This guide shows how to optimally place work across GPU and CPU to maximize efficiency and minimize cost.
## Component Analysis
### GPU-Accelerated Operations
| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **DA3 Inference** | 10-30 sec | N/A | β
Must be GPU (PyTorch model) |
| **Feature Extraction (SuperPoint)** | 1-2 min | N/A | β
Can be GPU (via hloc) |
| **Feature Matching (LightGlue)** | 2-5 min | N/A | β
Can be GPU (via hloc) |
| **Training** | Hours | N/A | β
Must be GPU (PyTorch) |
### CPU-Only Operations
| Component | GPU Time | CPU Time | Notes |
|-----------|----------|---------|-------|
| **COLMAP BA** | N/A | 2-8 min | β CPU-only (no GPU support) |
| **Early Filtering** | N/A | <1 sec | β
CPU (negligible) |
| **Data Loading** | N/A | <1 sec | β
CPU (I/O bound) |
| **Cache Operations** | N/A | <1 sec | β
CPU (I/O bound) |
## Current Pipeline Flow
### Sequential (Inefficient):
```
Sequence 1:
GPU: DA3 inference (30s) β Feature extraction (2min) β Matching (5min)
CPU: BA (5min) [waits for GPU]
Total: ~12 min
Sequence 2:
GPU: DA3 inference (30s) β Feature extraction (2min) β Matching (5min)
CPU: BA (5min) [waits for GPU]
Total: ~12 min
Total: 24 min (GPU idle during BA, CPU idle during GPU ops)
```
### Optimized Pipeline:
```
Parallel Execution:
GPU: DA3 inference (Sequence 1) β Feature extraction β Matching
CPU: BA (Sequence 2) [from cache or previous run]
GPU: DA3 inference (Sequence 2) β Feature extraction β Matching
CPU: BA (Sequence 3) [from cache or previous run]
GPU: Training (batched)
CPU: BA (other sequences in parallel)
```
## Optimal Placement Strategy
### Strategy 1: Separate GPU and CPU Workflows (Recommended)
**Phase 1: Dataset Building (GPU + CPU in parallel)**
```
GPU Pipeline (one sequence at a time):
1. DA3 inference (30s)
2. Feature extraction (2min)
3. Feature matching (5min)
Total: ~7-8 min per sequence
CPU Pipeline (parallel workers):
1. BA validation (5-8 min per sequence)
2. Can run 4-8 BA jobs in parallel (CPU cores)
Key: GPU and CPU work on different sequences simultaneously
```
**Phase 2: Training (GPU only)**
```
GPU: Training (hours)
CPU: Idle (or can pre-process next batch)
```
### Strategy 2: Pre-Compute BA on CPU Cluster
**Use cheap CPU instances for BA:**
```
1. Run BA on CPU cluster (spot instances, cheaper)
- 100 sequences Γ 5 min = 8 hours
- Cost: ~$10-20 (vs $100+ on GPU instance)
2. Run DA3 + Training on GPU instance
- DA3 inference: 50 min
- Training: 20-40 hours
- Cost: GPU instance time
Total cost savings: 80-90% for BA phase
```
### Strategy 3: Hybrid Approach (Best for Development)
**Single GPU instance with smart scheduling:**
```python
# Pseudo-code for optimal scheduling
def process_sequences_optimally(sequences):
gpu_queue = []
cpu_queue = []
for seq in sequences:
# Check cache first
if ba_cached(seq):
# Only need DA3 inference (GPU)
gpu_queue.append(seq)
else:
# Need full pipeline
# Schedule GPU work first
gpu_queue.append(seq)
# Schedule BA on CPU (can run in parallel)
cpu_queue.append(seq)
# Process GPU queue (one at a time)
with GPU():
for seq in gpu_queue:
da3_inference(seq)
extract_features(seq)
match_features(seq)
# Process CPU queue (parallel workers)
with ThreadPoolExecutor(max_workers=8):
for seq in cpu_queue:
run_ba(seq) # CPU-only
```
## Implementation Recommendations
### 1. Separate GPU and CPU Workers
**Current implementation uses ThreadPoolExecutor for sequences, but doesn't separate GPU/CPU work.**
**Recommended changes:**
```python
# In pretrain.py
class ARKitPretrainPipeline:
def __init__(self, ..., gpu_device="cuda", cpu_workers=8):
self.gpu_device = gpu_device
self.cpu_workers = cpu_workers
def process_arkit_sequence(self, ...):
# GPU work
with torch.cuda.device(self.gpu_device):
da3_output = self.model.inference(images)
features = extract_features_gpu(images)
matches = match_features_gpu(features)
# CPU work (can run in parallel with other sequences)
ba_result = self._run_ba_cpu(images, features, matches)
return sample
```
### 2. Pipeline Separation
**Separate dataset building into GPU and CPU phases:**
```python
# Phase 1: GPU work (sequential, one sequence at a time)
def build_dataset_gpu_phase(sequences):
results = []
for seq in sequences:
if ba_cached(seq):
# Only GPU work needed
da3_output = model.inference(seq.images)
results.append({
'seq': seq,
'da3_output': da3_output,
'needs_ba': False
})
else:
# Full GPU pipeline
features = extract_features(seq.images)
matches = match_features(features)
results.append({
'seq': seq,
'features': features,
'matches': matches,
'needs_ba': True
})
return results
# Phase 2: CPU work (parallel, many sequences at once)
def build_dataset_cpu_phase(gpu_results):
with ThreadPoolExecutor(max_workers=8) as executor:
futures = []
for result in gpu_results:
if result['needs_ba']:
future = executor.submit(
run_ba_cpu,
result['seq'],
result['features'],
result['matches']
)
futures.append(future)
# Collect results
ba_results = [f.result() for f in futures]
return ba_results
```
### 3. Resource-Aware Scheduling
**Schedule based on resource availability:**
```python
class ResourceAwareScheduler:
def __init__(self, gpu_device="cuda", cpu_workers=8):
self.gpu_queue = Queue()
self.cpu_queue = Queue()
self.gpu_busy = False
self.cpu_slots = cpu_workers
def schedule(self, sequence):
if ba_cached(sequence):
# Only GPU work
self.gpu_queue.put(('inference_only', sequence))
else:
# GPU work first
self.gpu_queue.put(('full_pipeline', sequence))
# Then CPU work
self.cpu_queue.put(('ba', sequence))
def process(self):
# GPU worker (sequential)
while not self.gpu_queue.empty():
task_type, seq = self.gpu_queue.get()
if task_type == 'inference_only':
da3_inference(seq)
else:
features = extract_features(seq)
matches = match_features(features)
self.cpu_queue.put(('ba_with_data', seq, features, matches))
# CPU workers (parallel)
with ThreadPoolExecutor(max_workers=self.cpu_slots):
while not self.cpu_queue.empty():
task = self.cpu_queue.get()
if task[0] == 'ba':
run_ba(task[1])
else:
run_ba(task[1], task[2], task[3])
```
## Cost Optimization Strategies
### 1. Use Spot Instances for BA
**BA is CPU-only and can run on cheap spot instances:**
```bash
# Run BA on spot instance (10x cheaper)
aws ec2 run-instances \
--instance-type c5.4xlarge \
--spot-price 0.10 \
--instance-market-options file://spot-config.json
# Cost: ~$0.10/hour vs $1.00/hour for GPU
# 100 sequences Γ 5 min = 8 hours = $0.80 vs $8.00
```
### 2. Pre-Compute BA Offline
**Run BA on local machine or cheap CPU cluster:**
```bash
# On local machine or CPU cluster
ylff train pretrain data/arkit_sequences \
--epochs 0 \ # Just build dataset
--num-workers 8 \
--cache-dir data/pretrain_cache
# Then train on GPU instance (expensive)
ylff train pretrain data/arkit_sequences \
--epochs 20 \
--cache-dir data/pretrain_cache # Uses cached BA
```
### 3. Hybrid Cloud Strategy
**Use different instance types for different phases:**
```
Phase 1: Dataset Building
- GPU instance (1x): DA3 inference, feature extraction/matching
- CPU instance (8x spot): BA validation
- Cost: GPU ($2/hr) + CPU ($0.80/hr) = $2.80/hr
- Time: 2-3 hours
- Total: ~$8-10
Phase 2: Training
- GPU instance (1x): Training only
- Cost: $2/hr Γ 20 hours = $40
- Total: $40
Total: $50 (vs $100+ if all on GPU)
```
## Recommended Implementation
### For Development (Single Machine):
```python
# Optimal single-machine setup
class OptimalPretrainPipeline:
def __init__(self, gpu_device="cuda", cpu_workers=8):
self.gpu_device = gpu_device
self.cpu_workers = cpu_workers
def build_dataset(self, sequences):
# Separate GPU and CPU work
gpu_results = []
cpu_tasks = []
# Phase 1: GPU work (sequential)
for seq in sequences:
if self._is_ba_cached(seq):
# Only inference needed
result = self._run_da3_inference(seq)
gpu_results.append(result)
else:
# Full GPU pipeline
result = self._run_gpu_pipeline(seq)
gpu_results.append(result)
cpu_tasks.append(result) # Needs BA
# Phase 2: CPU work (parallel)
with ThreadPoolExecutor(max_workers=self.cpu_workers) as executor:
ba_futures = {
executor.submit(self._run_ba_cpu, task): task
for task in cpu_tasks
}
# Collect BA results
for future in as_completed(ba_futures):
ba_result = future.result()
# Merge with GPU result
self._merge_results(ba_futures[future], ba_result)
return gpu_results
```
### For Production (Distributed):
```python
# Distributed setup
class DistributedPretrainPipeline:
def __init__(self, gpu_nodes=1, cpu_nodes=8):
self.gpu_nodes = gpu_nodes
self.cpu_nodes = cpu_nodes
def build_dataset(self, sequences):
# GPU nodes: DA3 inference, features, matching
gpu_tasks = self._distribute_gpu_work(sequences)
# CPU nodes: BA validation (parallel)
cpu_tasks = self._distribute_cpu_work(sequences)
# Collect and merge
return self._collect_results(gpu_tasks, cpu_tasks)
```
## Performance Comparison
### Current (Sequential):
```
100 sequences:
GPU: 7 min Γ 100 = 700 min (11.7 hours)
CPU: 5 min Γ 100 = 500 min (8.3 hours)
Total: 20 hours (sequential)
GPU utilization: 50%
CPU utilization: 50%
```
### Optimized (Parallel):
```
100 sequences:
GPU: 7 min Γ 100 = 700 min (11.7 hours) [sequential]
CPU: 5 min Γ 100 / 8 workers = 62.5 min (1 hour) [parallel]
Total: 12.7 hours (overlapped)
GPU utilization: 90%
CPU utilization: 90%
Speedup: 1.6x
```
### With Caching:
```
100 sequences (50 cached):
GPU: 7 min Γ 50 = 350 min (5.8 hours)
CPU: 5 min Γ 50 / 8 workers = 31 min [parallel]
Total: 6 hours
Speedup: 3.3x vs sequential
```
## Recommendations
### Immediate Actions:
1. β
**Separate GPU and CPU work** in pipeline
2. β
**Use parallel CPU workers** for BA (already done)
3. β
**Pre-compute BA** on cheap CPU instances
4. β
**Cache aggressively** (already done)
### Future Optimizations:
1. **Batch GPU operations**: Process multiple sequences on GPU simultaneously
2. **Pipeline overlap**: Start CPU BA while GPU processes next sequence
3. **Distributed BA**: Run BA on multiple CPU nodes
4. **GPU feature extraction**: Ensure SuperPoint/LightGlue use GPU
### Cost Savings:
- **Current**: All on GPU instance = $100-200 for 100 sequences
- **Optimized**: GPU + CPU spot = $20-40 for 100 sequences
- **Savings**: 80-90% reduction in compute costs
|