| # BA Pipeline Optimization Guide | |
| ## Current Bottlenecks Analysis | |
| ### 1. Feature Extraction (SuperPoint) | |
| - **Current**: `num_workers=1` (sequential) | |
| - **Bottleneck**: I/O and GPU utilization | |
| - **Impact**: For 20 images, ~2-5 seconds; for 100 images, ~10-25 seconds | |
| ### 2. Feature Matching (LightGlue) | |
| - **Current**: Sequential pair processing (`batch_size=1`) | |
| - **Bottleneck**: GPU underutilization, sequential loop | |
| - **Impact**: For 190 pairs (20 images), ~30-60 seconds; for 4950 pairs (100 images), ~15-30 minutes | |
| ### 3. COLMAP Reconstruction | |
| - **Current**: Sequential incremental SfM | |
| - **Bottleneck**: Sequential nature, many failed initializations (see log) | |
| - **Impact**: Variable, but can be slow for large sequences | |
| ### 4. Bundle Adjustment | |
| - **Current**: CPU-based Levenberg-Marquardt | |
| - **Bottleneck**: Sequential optimization, no GPU acceleration | |
| - **Impact**: Usually fast (<1s for small reconstructions), but scales poorly | |
| --- | |
| ## Optimization Strategies | |
| ### Level 1: Quick Wins (Easy, High Impact) | |
| #### 1.1 Parallelize Feature Extraction | |
| ```python | |
| # In ylff/ba_validator.py | |
| def _extract_features(self, image_paths: List[str]) -> Path: | |
| # hloc uses num_workers=1 by default | |
| # We can't directly change this, but we can: | |
| # Option A: Process images in parallel batches | |
| from concurrent.futures import ThreadPoolExecutor | |
| import torch | |
| def extract_single(image_path): | |
| # Extract features for one image | |
| # This would require modifying hloc or calling SuperPoint directly | |
| pass | |
| # Option B: Use hloc's batch processing if available | |
| # Check if hloc supports batch_size > 1 | |
| ``` | |
| **Expected Speedup**: 3-5x for feature extraction | |
| #### 1.2 Increase Match Workers | |
| ```python | |
| # hloc.match_features uses num_workers=5 by default | |
| # We can't directly change this without modifying hloc source | |
| # But we can create a wrapper that processes pairs in batches | |
| ``` | |
| **Expected Speedup**: 2-3x for matching (I/O bound) | |
| #### 1.3 Smart Pair Selection (Reduce Pairs) | |
| Instead of exhaustive matching (N\*(N-1)/2 pairs), use: | |
| - **Sequential pairs**: Only match consecutive frames (N-1 pairs) | |
| - **Sparse matching**: Match every K-th frame (N/K pairs) | |
| - **Spatial selection**: Use DA3 poses to select nearby frames | |
| ```python | |
| def _generate_smart_pairs( | |
| self, | |
| image_paths: List[str], | |
| poses: np.ndarray, | |
| max_baseline: float = 0.3, # Max translation distance | |
| min_baseline: float = 0.05, # Min translation distance | |
| ) -> List[Tuple[str, str]]: | |
| """Generate pairs based on spatial proximity.""" | |
| pairs = [] | |
| for i in range(len(image_paths)): | |
| for j in range(i + 1, len(image_paths)): | |
| # Compute baseline | |
| t_i = poses[i][:3, 3] | |
| t_j = poses[j][:3, 3] | |
| baseline = np.linalg.norm(t_i - t_j) | |
| if min_baseline <= baseline <= max_baseline: | |
| pairs.append((image_paths[i], image_paths[j])) | |
| return pairs | |
| ``` | |
| **Expected Speedup**: 5-10x reduction in pairs (e.g., 190 → 20-40 pairs) | |
| --- | |
| ### Level 2: Moderate Effort (Medium Impact) | |
| #### 2.1 Batch Pair Matching | |
| LightGlue can process multiple pairs in a single batch: | |
| ```python | |
| class BatchedPairMatcher: | |
| def __init__(self, model, device, batch_size=4): | |
| self.model = model | |
| self.device = device | |
| self.batch_size = batch_size | |
| def match_batch(self, pairs_data): | |
| """Match multiple pairs in a single forward pass.""" | |
| # Stack features | |
| features1 = torch.stack([p['feat1'] for p in pairs_data]) | |
| features2 = torch.stack([p['feat2'] for p in pairs_data]) | |
| # Batch matching | |
| matches = self.model({ | |
| 'image0': features1, | |
| 'image1': features2, | |
| }) | |
| return matches | |
| ``` | |
| **Expected Speedup**: 2-4x for matching (GPU utilization) | |
| #### 2.2 COLMAP Initialization from DA3 Poses | |
| Instead of letting COLMAP find initial pairs, initialize from DA3: | |
| ```python | |
| def _initialize_from_poses( | |
| self, | |
| reconstruction: pycolmap.Reconstruction, | |
| initial_poses: np.ndarray, | |
| image_paths: List[str], | |
| ): | |
| """Initialize COLMAP reconstruction with DA3 poses.""" | |
| # Add all images with initial poses | |
| for i, (img_path, pose) in enumerate(zip(image_paths, initial_poses)): | |
| # Convert w2c to c2w | |
| c2w = np.linalg.inv(pose) | |
| image = pycolmap.Image() | |
| image.name = Path(img_path).name | |
| image.set_pose(pycolmap.Pose(c2w[:3, :3], c2w[:3, 3])) | |
| reconstruction.add_image(image) | |
| # Triangulate initial points from matches | |
| # Then run BA | |
| ``` | |
| **Expected Speedup**: Eliminates failed initialization attempts | |
| #### 2.3 Feature Caching | |
| Cache extracted features to avoid re-extraction: | |
| ```python | |
| import hashlib | |
| import pickle | |
| def _get_feature_cache_key(self, image_path: str) -> str: | |
| """Generate cache key from image hash.""" | |
| with open(image_path, 'rb') as f: | |
| img_hash = hashlib.md5(f.read()).hexdigest() | |
| return f"features_{img_hash}" | |
| def _extract_features_cached(self, image_paths: List[str]) -> Path: | |
| """Extract features with caching.""" | |
| cache_dir = self.work_dir / "feature_cache" | |
| cache_dir.mkdir(exist_ok=True) | |
| cached_features = {} | |
| uncached_paths = [] | |
| for img_path in image_paths: | |
| cache_key = self._get_feature_cache_key(img_path) | |
| cache_file = cache_dir / f"{cache_key}.pkl" | |
| if cache_file.exists(): | |
| with open(cache_file, 'rb') as f: | |
| cached_features[img_path] = pickle.load(f) | |
| else: | |
| uncached_paths.append(img_path) | |
| # Extract uncached features | |
| if uncached_paths: | |
| new_features = self._extract_features(uncached_paths) | |
| # Cache them | |
| for img_path, feat in zip(uncached_paths, new_features): | |
| cache_key = self._get_feature_cache_key(img_path) | |
| cache_file = cache_dir / f"{cache_key}.pkl" | |
| with open(cache_file, 'wb') as f: | |
| pickle.dump(feat, f) | |
| return cached_features | |
| ``` | |
| **Expected Speedup**: 10-100x for repeated sequences | |
| --- | |
| ### Level 3: Advanced (High Impact, More Complex) | |
| #### 3.1 GPU-Accelerated Bundle Adjustment | |
| Use GPU-accelerated BA libraries: | |
| **Option A: g2o (GPU)** | |
| ```python | |
| # g2o has GPU support via CUDA | |
| # Requires building g2o with CUDA | |
| ``` | |
| **Option B: Ceres Solver (GPU)** | |
| ```python | |
| # Ceres has experimental GPU support | |
| # Requires CUDA and custom build | |
| ``` | |
| **Option C: Theseus (PyTorch-based, GPU-native)** | |
| ```python | |
| from theseus import Optimizer, CostFunction | |
| import torch | |
| class BundleAdjustmentCost(CostFunction): | |
| def __init__(self, observations, camera_params): | |
| # Define reprojection error | |
| pass | |
| optimizer = Optimizer( | |
| cost_functions=[BundleAdjustmentCost(...)], | |
| optimizer_cls=torch.optim.Adam, | |
| ) | |
| ``` | |
| **Expected Speedup**: 10-100x for BA (depending on problem size) | |
| #### 3.2 Distributed Matching | |
| Process pairs across multiple GPUs: | |
| ```python | |
| import torch.distributed as dist | |
| from torch.nn.parallel import DistributedDataParallel | |
| def match_distributed(pairs, model, num_gpus=4): | |
| """Distribute pair matching across GPUs.""" | |
| # Split pairs across GPUs | |
| pairs_per_gpu = len(pairs) // num_gpus | |
| # Process in parallel | |
| results = [] | |
| for gpu_id in range(num_gpus): | |
| gpu_pairs = pairs[gpu_id * pairs_per_gpu:(gpu_id + 1) * pairs_per_gpu] | |
| # Process on GPU gpu_id | |
| results.extend(process_on_gpu(gpu_pairs, gpu_id)) | |
| return results | |
| ``` | |
| **Expected Speedup**: Linear scaling with number of GPUs | |
| #### 3.3 Incremental BA | |
| Instead of full BA, use incremental updates: | |
| ```python | |
| def incremental_ba( | |
| self, | |
| reconstruction: pycolmap.Reconstruction, | |
| new_images: List[str], | |
| new_poses: np.ndarray, | |
| ): | |
| """Add new images incrementally and run local BA.""" | |
| # Add new images | |
| # Run local BA (only optimize new images + neighbors) | |
| # Full BA only periodically | |
| ``` | |
| **Expected Speedup**: 5-10x for large sequences | |
| --- | |
| ### Level 4: Research-Level (Maximum Impact) | |
| #### 4.1 Learned Feature Matching | |
| Use learned matchers that are faster than LightGlue: | |
| - **LoFTR**: Attention-based, can be faster | |
| - **QuadTree Attention**: More efficient attention mechanism | |
| - **Sparse Matching**: Only match high-confidence features | |
| #### 4.2 Differentiable BA | |
| Train end-to-end with differentiable BA: | |
| ```python | |
| from theseus import TheseusLayer | |
| class DifferentiableBA(nn.Module): | |
| def __init__(self): | |
| super().__init__() | |
| self.ba_layer = TheseusLayer(...) | |
| def forward(self, features, initial_poses): | |
| # Differentiable BA | |
| refined_poses = self.ba_layer(features, initial_poses) | |
| return refined_poses | |
| ``` | |
| **Benefit**: Can be integrated into training loop | |
| #### 4.3 Neural BA | |
| Replace traditional BA with a learned optimizer: | |
| ```python | |
| class NeuralBA(nn.Module): | |
| """Neural network that learns to optimize BA.""" | |
| def __init__(self): | |
| super().__init__() | |
| self.optimizer_net = nn.Transformer(...) | |
| def forward(self, reprojection_errors, poses): | |
| # Learn to predict pose updates | |
| pose_deltas = self.optimizer_net(reprojection_errors, poses) | |
| return poses + pose_deltas | |
| ``` | |
| --- | |
| ## Implementation Priority | |
| ### Phase 1: Quick Wins (1-2 days) | |
| 1. ✅ Smart pair selection (reduce pairs by 5-10x) | |
| 2. ✅ Feature caching | |
| 3. ✅ COLMAP initialization from DA3 poses | |
| **Expected Overall Speedup**: 5-10x | |
| ### Phase 2: Moderate (1 week) | |
| 1. Batch pair matching | |
| 2. Parallel feature extraction wrapper | |
| 3. Incremental BA | |
| **Expected Overall Speedup**: 10-20x | |
| ### Phase 3: Advanced (2-4 weeks) | |
| 1. GPU-accelerated BA (Theseus) | |
| 2. Distributed matching | |
| 3. Learned optimizations | |
| **Expected Overall Speedup**: 20-100x | |
| --- | |
| ## Memory Optimization | |
| ### Current Memory Usage | |
| - Features: ~1-5 MB per image (SuperPoint) | |
| - Matches: ~0.1-1 MB per pair (LightGlue) | |
| - COLMAP database: ~10-50 MB for 100 images | |
| ### Optimization Strategies | |
| 1. **Streaming Processing**: Process pairs in batches, don't load all at once | |
| 2. **Feature Compression**: Use half-precision (float16) for features | |
| 3. **Match Filtering**: Only store high-quality matches | |
| 4. **Garbage Collection**: Explicitly free memory after each stage | |
| ```python | |
| import gc | |
| import torch | |
| def process_with_memory_management(self, images): | |
| # Process features | |
| features = self._extract_features(images) | |
| del images # Free memory | |
| gc.collect() | |
| torch.cuda.empty_cache() if torch.cuda.is_available() else None | |
| # Process matches | |
| matches = self._match_features(features) | |
| del features | |
| gc.collect() | |
| return matches | |
| ``` | |
| --- | |
| ## Benchmarking | |
| Create a benchmark script to measure improvements: | |
| ```python | |
| import time | |
| from ylff.ba_validator import BAValidator | |
| def benchmark_ba_pipeline(images, poses, intrinsics): | |
| validator = BAValidator() | |
| times = {} | |
| # Feature extraction | |
| start = time.time() | |
| features = validator._extract_features(images) | |
| times['features'] = time.time() - start | |
| # Matching | |
| start = time.time() | |
| matches = validator._match_features(images, features) | |
| times['matching'] = time.time() - start | |
| # BA | |
| start = time.time() | |
| result = validator._run_colmap_ba(images, features, matches, poses, intrinsics) | |
| times['ba'] = time.time() - start | |
| return times, result | |
| ``` | |
| --- | |
| ## Recommended Implementation Order | |
| 1. **Smart Pair Selection** (Highest ROI, easiest) | |
| 2. **Feature Caching** (High ROI, easy) | |
| 3. **COLMAP Initialization** (Medium ROI, medium effort) | |
| 4. **Batch Matching** (Medium ROI, medium effort) | |
| 5. **GPU BA** (High ROI, high effort) | |
| --- | |
| ## Expected Performance | |
| ### Current (20 images, 190 pairs) | |
| - Feature extraction: ~5s | |
| - Matching: ~60s | |
| - BA: ~5s | |
| - **Total: ~70s** | |
| ### After Phase 1 (Smart pairs + caching) | |
| - Feature extraction: ~5s (first time), ~0.1s (cached) | |
| - Matching: ~6s (20 pairs instead of 190) | |
| - BA: ~2s (better initialization) | |
| - **Total: ~8s (10x speedup)** | |
| ### After Phase 2 (Batching + incremental) | |
| - Feature extraction: ~2s | |
| - Matching: ~3s (batched) | |
| - BA: ~1s (incremental) | |
| - **Total: ~6s (12x speedup)** | |
| ### After Phase 3 (GPU BA) | |
| - Feature extraction: ~2s | |
| - Matching: ~3s | |
| - BA: ~0.1s (GPU) | |
| - **Total: ~5s (14x speedup)** | |
| --- | |
| ## Next Steps | |
| 1. Implement smart pair selection | |
| 2. Add feature caching | |
| 3. Improve COLMAP initialization | |
| 4. Benchmark and iterate | |