# BA Pipeline Optimization Guide ## Current Bottlenecks Analysis ### 1. Feature Extraction (SuperPoint) - **Current**: `num_workers=1` (sequential) - **Bottleneck**: I/O and GPU utilization - **Impact**: For 20 images, ~2-5 seconds; for 100 images, ~10-25 seconds ### 2. Feature Matching (LightGlue) - **Current**: Sequential pair processing (`batch_size=1`) - **Bottleneck**: GPU underutilization, sequential loop - **Impact**: For 190 pairs (20 images), ~30-60 seconds; for 4950 pairs (100 images), ~15-30 minutes ### 3. COLMAP Reconstruction - **Current**: Sequential incremental SfM - **Bottleneck**: Sequential nature, many failed initializations (see log) - **Impact**: Variable, but can be slow for large sequences ### 4. Bundle Adjustment - **Current**: CPU-based Levenberg-Marquardt - **Bottleneck**: Sequential optimization, no GPU acceleration - **Impact**: Usually fast (<1s for small reconstructions), but scales poorly --- ## Optimization Strategies ### Level 1: Quick Wins (Easy, High Impact) #### 1.1 Parallelize Feature Extraction ```python # In ylff/ba_validator.py def _extract_features(self, image_paths: List[str]) -> Path: # hloc uses num_workers=1 by default # We can't directly change this, but we can: # Option A: Process images in parallel batches from concurrent.futures import ThreadPoolExecutor import torch def extract_single(image_path): # Extract features for one image # This would require modifying hloc or calling SuperPoint directly pass # Option B: Use hloc's batch processing if available # Check if hloc supports batch_size > 1 ``` **Expected Speedup**: 3-5x for feature extraction #### 1.2 Increase Match Workers ```python # hloc.match_features uses num_workers=5 by default # We can't directly change this without modifying hloc source # But we can create a wrapper that processes pairs in batches ``` **Expected Speedup**: 2-3x for matching (I/O bound) #### 1.3 Smart Pair Selection (Reduce Pairs) Instead of exhaustive matching (N\*(N-1)/2 pairs), use: - **Sequential pairs**: Only match consecutive frames (N-1 pairs) - **Sparse matching**: Match every K-th frame (N/K pairs) - **Spatial selection**: Use DA3 poses to select nearby frames ```python def _generate_smart_pairs( self, image_paths: List[str], poses: np.ndarray, max_baseline: float = 0.3, # Max translation distance min_baseline: float = 0.05, # Min translation distance ) -> List[Tuple[str, str]]: """Generate pairs based on spatial proximity.""" pairs = [] for i in range(len(image_paths)): for j in range(i + 1, len(image_paths)): # Compute baseline t_i = poses[i][:3, 3] t_j = poses[j][:3, 3] baseline = np.linalg.norm(t_i - t_j) if min_baseline <= baseline <= max_baseline: pairs.append((image_paths[i], image_paths[j])) return pairs ``` **Expected Speedup**: 5-10x reduction in pairs (e.g., 190 → 20-40 pairs) --- ### Level 2: Moderate Effort (Medium Impact) #### 2.1 Batch Pair Matching LightGlue can process multiple pairs in a single batch: ```python class BatchedPairMatcher: def __init__(self, model, device, batch_size=4): self.model = model self.device = device self.batch_size = batch_size def match_batch(self, pairs_data): """Match multiple pairs in a single forward pass.""" # Stack features features1 = torch.stack([p['feat1'] for p in pairs_data]) features2 = torch.stack([p['feat2'] for p in pairs_data]) # Batch matching matches = self.model({ 'image0': features1, 'image1': features2, }) return matches ``` **Expected Speedup**: 2-4x for matching (GPU utilization) #### 2.2 COLMAP Initialization from DA3 Poses Instead of letting COLMAP find initial pairs, initialize from DA3: ```python def _initialize_from_poses( self, reconstruction: pycolmap.Reconstruction, initial_poses: np.ndarray, image_paths: List[str], ): """Initialize COLMAP reconstruction with DA3 poses.""" # Add all images with initial poses for i, (img_path, pose) in enumerate(zip(image_paths, initial_poses)): # Convert w2c to c2w c2w = np.linalg.inv(pose) image = pycolmap.Image() image.name = Path(img_path).name image.set_pose(pycolmap.Pose(c2w[:3, :3], c2w[:3, 3])) reconstruction.add_image(image) # Triangulate initial points from matches # Then run BA ``` **Expected Speedup**: Eliminates failed initialization attempts #### 2.3 Feature Caching Cache extracted features to avoid re-extraction: ```python import hashlib import pickle def _get_feature_cache_key(self, image_path: str) -> str: """Generate cache key from image hash.""" with open(image_path, 'rb') as f: img_hash = hashlib.md5(f.read()).hexdigest() return f"features_{img_hash}" def _extract_features_cached(self, image_paths: List[str]) -> Path: """Extract features with caching.""" cache_dir = self.work_dir / "feature_cache" cache_dir.mkdir(exist_ok=True) cached_features = {} uncached_paths = [] for img_path in image_paths: cache_key = self._get_feature_cache_key(img_path) cache_file = cache_dir / f"{cache_key}.pkl" if cache_file.exists(): with open(cache_file, 'rb') as f: cached_features[img_path] = pickle.load(f) else: uncached_paths.append(img_path) # Extract uncached features if uncached_paths: new_features = self._extract_features(uncached_paths) # Cache them for img_path, feat in zip(uncached_paths, new_features): cache_key = self._get_feature_cache_key(img_path) cache_file = cache_dir / f"{cache_key}.pkl" with open(cache_file, 'wb') as f: pickle.dump(feat, f) return cached_features ``` **Expected Speedup**: 10-100x for repeated sequences --- ### Level 3: Advanced (High Impact, More Complex) #### 3.1 GPU-Accelerated Bundle Adjustment Use GPU-accelerated BA libraries: **Option A: g2o (GPU)** ```python # g2o has GPU support via CUDA # Requires building g2o with CUDA ``` **Option B: Ceres Solver (GPU)** ```python # Ceres has experimental GPU support # Requires CUDA and custom build ``` **Option C: Theseus (PyTorch-based, GPU-native)** ```python from theseus import Optimizer, CostFunction import torch class BundleAdjustmentCost(CostFunction): def __init__(self, observations, camera_params): # Define reprojection error pass optimizer = Optimizer( cost_functions=[BundleAdjustmentCost(...)], optimizer_cls=torch.optim.Adam, ) ``` **Expected Speedup**: 10-100x for BA (depending on problem size) #### 3.2 Distributed Matching Process pairs across multiple GPUs: ```python import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel def match_distributed(pairs, model, num_gpus=4): """Distribute pair matching across GPUs.""" # Split pairs across GPUs pairs_per_gpu = len(pairs) // num_gpus # Process in parallel results = [] for gpu_id in range(num_gpus): gpu_pairs = pairs[gpu_id * pairs_per_gpu:(gpu_id + 1) * pairs_per_gpu] # Process on GPU gpu_id results.extend(process_on_gpu(gpu_pairs, gpu_id)) return results ``` **Expected Speedup**: Linear scaling with number of GPUs #### 3.3 Incremental BA Instead of full BA, use incremental updates: ```python def incremental_ba( self, reconstruction: pycolmap.Reconstruction, new_images: List[str], new_poses: np.ndarray, ): """Add new images incrementally and run local BA.""" # Add new images # Run local BA (only optimize new images + neighbors) # Full BA only periodically ``` **Expected Speedup**: 5-10x for large sequences --- ### Level 4: Research-Level (Maximum Impact) #### 4.1 Learned Feature Matching Use learned matchers that are faster than LightGlue: - **LoFTR**: Attention-based, can be faster - **QuadTree Attention**: More efficient attention mechanism - **Sparse Matching**: Only match high-confidence features #### 4.2 Differentiable BA Train end-to-end with differentiable BA: ```python from theseus import TheseusLayer class DifferentiableBA(nn.Module): def __init__(self): super().__init__() self.ba_layer = TheseusLayer(...) def forward(self, features, initial_poses): # Differentiable BA refined_poses = self.ba_layer(features, initial_poses) return refined_poses ``` **Benefit**: Can be integrated into training loop #### 4.3 Neural BA Replace traditional BA with a learned optimizer: ```python class NeuralBA(nn.Module): """Neural network that learns to optimize BA.""" def __init__(self): super().__init__() self.optimizer_net = nn.Transformer(...) def forward(self, reprojection_errors, poses): # Learn to predict pose updates pose_deltas = self.optimizer_net(reprojection_errors, poses) return poses + pose_deltas ``` --- ## Implementation Priority ### Phase 1: Quick Wins (1-2 days) 1. ✅ Smart pair selection (reduce pairs by 5-10x) 2. ✅ Feature caching 3. ✅ COLMAP initialization from DA3 poses **Expected Overall Speedup**: 5-10x ### Phase 2: Moderate (1 week) 1. Batch pair matching 2. Parallel feature extraction wrapper 3. Incremental BA **Expected Overall Speedup**: 10-20x ### Phase 3: Advanced (2-4 weeks) 1. GPU-accelerated BA (Theseus) 2. Distributed matching 3. Learned optimizations **Expected Overall Speedup**: 20-100x --- ## Memory Optimization ### Current Memory Usage - Features: ~1-5 MB per image (SuperPoint) - Matches: ~0.1-1 MB per pair (LightGlue) - COLMAP database: ~10-50 MB for 100 images ### Optimization Strategies 1. **Streaming Processing**: Process pairs in batches, don't load all at once 2. **Feature Compression**: Use half-precision (float16) for features 3. **Match Filtering**: Only store high-quality matches 4. **Garbage Collection**: Explicitly free memory after each stage ```python import gc import torch def process_with_memory_management(self, images): # Process features features = self._extract_features(images) del images # Free memory gc.collect() torch.cuda.empty_cache() if torch.cuda.is_available() else None # Process matches matches = self._match_features(features) del features gc.collect() return matches ``` --- ## Benchmarking Create a benchmark script to measure improvements: ```python import time from ylff.ba_validator import BAValidator def benchmark_ba_pipeline(images, poses, intrinsics): validator = BAValidator() times = {} # Feature extraction start = time.time() features = validator._extract_features(images) times['features'] = time.time() - start # Matching start = time.time() matches = validator._match_features(images, features) times['matching'] = time.time() - start # BA start = time.time() result = validator._run_colmap_ba(images, features, matches, poses, intrinsics) times['ba'] = time.time() - start return times, result ``` --- ## Recommended Implementation Order 1. **Smart Pair Selection** (Highest ROI, easiest) 2. **Feature Caching** (High ROI, easy) 3. **COLMAP Initialization** (Medium ROI, medium effort) 4. **Batch Matching** (Medium ROI, medium effort) 5. **GPU BA** (High ROI, high effort) --- ## Expected Performance ### Current (20 images, 190 pairs) - Feature extraction: ~5s - Matching: ~60s - BA: ~5s - **Total: ~70s** ### After Phase 1 (Smart pairs + caching) - Feature extraction: ~5s (first time), ~0.1s (cached) - Matching: ~6s (20 pairs instead of 190) - BA: ~2s (better initialization) - **Total: ~8s (10x speedup)** ### After Phase 2 (Batching + incremental) - Feature extraction: ~2s - Matching: ~3s (batched) - BA: ~1s (incremental) - **Total: ~6s (12x speedup)** ### After Phase 3 (GPU BA) - Feature extraction: ~2s - Matching: ~3s - BA: ~0.1s (GPU) - **Total: ~5s (14x speedup)** --- ## Next Steps 1. Implement smart pair selection 2. Add feature caching 3. Improve COLMAP initialization 4. Benchmark and iterate