3d_model / docs /BA_OPTIMIZATION_GUIDE.md
Azan
Clean deployment build (Squashed)
7a87926
# BA Pipeline Optimization Guide
## Current Bottlenecks Analysis
### 1. Feature Extraction (SuperPoint)
- **Current**: `num_workers=1` (sequential)
- **Bottleneck**: I/O and GPU utilization
- **Impact**: For 20 images, ~2-5 seconds; for 100 images, ~10-25 seconds
### 2. Feature Matching (LightGlue)
- **Current**: Sequential pair processing (`batch_size=1`)
- **Bottleneck**: GPU underutilization, sequential loop
- **Impact**: For 190 pairs (20 images), ~30-60 seconds; for 4950 pairs (100 images), ~15-30 minutes
### 3. COLMAP Reconstruction
- **Current**: Sequential incremental SfM
- **Bottleneck**: Sequential nature, many failed initializations (see log)
- **Impact**: Variable, but can be slow for large sequences
### 4. Bundle Adjustment
- **Current**: CPU-based Levenberg-Marquardt
- **Bottleneck**: Sequential optimization, no GPU acceleration
- **Impact**: Usually fast (<1s for small reconstructions), but scales poorly
---
## Optimization Strategies
### Level 1: Quick Wins (Easy, High Impact)
#### 1.1 Parallelize Feature Extraction
```python
# In ylff/ba_validator.py
def _extract_features(self, image_paths: List[str]) -> Path:
# hloc uses num_workers=1 by default
# We can't directly change this, but we can:
# Option A: Process images in parallel batches
from concurrent.futures import ThreadPoolExecutor
import torch
def extract_single(image_path):
# Extract features for one image
# This would require modifying hloc or calling SuperPoint directly
pass
# Option B: Use hloc's batch processing if available
# Check if hloc supports batch_size > 1
```
**Expected Speedup**: 3-5x for feature extraction
#### 1.2 Increase Match Workers
```python
# hloc.match_features uses num_workers=5 by default
# We can't directly change this without modifying hloc source
# But we can create a wrapper that processes pairs in batches
```
**Expected Speedup**: 2-3x for matching (I/O bound)
#### 1.3 Smart Pair Selection (Reduce Pairs)
Instead of exhaustive matching (N\*(N-1)/2 pairs), use:
- **Sequential pairs**: Only match consecutive frames (N-1 pairs)
- **Sparse matching**: Match every K-th frame (N/K pairs)
- **Spatial selection**: Use DA3 poses to select nearby frames
```python
def _generate_smart_pairs(
self,
image_paths: List[str],
poses: np.ndarray,
max_baseline: float = 0.3, # Max translation distance
min_baseline: float = 0.05, # Min translation distance
) -> List[Tuple[str, str]]:
"""Generate pairs based on spatial proximity."""
pairs = []
for i in range(len(image_paths)):
for j in range(i + 1, len(image_paths)):
# Compute baseline
t_i = poses[i][:3, 3]
t_j = poses[j][:3, 3]
baseline = np.linalg.norm(t_i - t_j)
if min_baseline <= baseline <= max_baseline:
pairs.append((image_paths[i], image_paths[j]))
return pairs
```
**Expected Speedup**: 5-10x reduction in pairs (e.g., 190 → 20-40 pairs)
---
### Level 2: Moderate Effort (Medium Impact)
#### 2.1 Batch Pair Matching
LightGlue can process multiple pairs in a single batch:
```python
class BatchedPairMatcher:
def __init__(self, model, device, batch_size=4):
self.model = model
self.device = device
self.batch_size = batch_size
def match_batch(self, pairs_data):
"""Match multiple pairs in a single forward pass."""
# Stack features
features1 = torch.stack([p['feat1'] for p in pairs_data])
features2 = torch.stack([p['feat2'] for p in pairs_data])
# Batch matching
matches = self.model({
'image0': features1,
'image1': features2,
})
return matches
```
**Expected Speedup**: 2-4x for matching (GPU utilization)
#### 2.2 COLMAP Initialization from DA3 Poses
Instead of letting COLMAP find initial pairs, initialize from DA3:
```python
def _initialize_from_poses(
self,
reconstruction: pycolmap.Reconstruction,
initial_poses: np.ndarray,
image_paths: List[str],
):
"""Initialize COLMAP reconstruction with DA3 poses."""
# Add all images with initial poses
for i, (img_path, pose) in enumerate(zip(image_paths, initial_poses)):
# Convert w2c to c2w
c2w = np.linalg.inv(pose)
image = pycolmap.Image()
image.name = Path(img_path).name
image.set_pose(pycolmap.Pose(c2w[:3, :3], c2w[:3, 3]))
reconstruction.add_image(image)
# Triangulate initial points from matches
# Then run BA
```
**Expected Speedup**: Eliminates failed initialization attempts
#### 2.3 Feature Caching
Cache extracted features to avoid re-extraction:
```python
import hashlib
import pickle
def _get_feature_cache_key(self, image_path: str) -> str:
"""Generate cache key from image hash."""
with open(image_path, 'rb') as f:
img_hash = hashlib.md5(f.read()).hexdigest()
return f"features_{img_hash}"
def _extract_features_cached(self, image_paths: List[str]) -> Path:
"""Extract features with caching."""
cache_dir = self.work_dir / "feature_cache"
cache_dir.mkdir(exist_ok=True)
cached_features = {}
uncached_paths = []
for img_path in image_paths:
cache_key = self._get_feature_cache_key(img_path)
cache_file = cache_dir / f"{cache_key}.pkl"
if cache_file.exists():
with open(cache_file, 'rb') as f:
cached_features[img_path] = pickle.load(f)
else:
uncached_paths.append(img_path)
# Extract uncached features
if uncached_paths:
new_features = self._extract_features(uncached_paths)
# Cache them
for img_path, feat in zip(uncached_paths, new_features):
cache_key = self._get_feature_cache_key(img_path)
cache_file = cache_dir / f"{cache_key}.pkl"
with open(cache_file, 'wb') as f:
pickle.dump(feat, f)
return cached_features
```
**Expected Speedup**: 10-100x for repeated sequences
---
### Level 3: Advanced (High Impact, More Complex)
#### 3.1 GPU-Accelerated Bundle Adjustment
Use GPU-accelerated BA libraries:
**Option A: g2o (GPU)**
```python
# g2o has GPU support via CUDA
# Requires building g2o with CUDA
```
**Option B: Ceres Solver (GPU)**
```python
# Ceres has experimental GPU support
# Requires CUDA and custom build
```
**Option C: Theseus (PyTorch-based, GPU-native)**
```python
from theseus import Optimizer, CostFunction
import torch
class BundleAdjustmentCost(CostFunction):
def __init__(self, observations, camera_params):
# Define reprojection error
pass
optimizer = Optimizer(
cost_functions=[BundleAdjustmentCost(...)],
optimizer_cls=torch.optim.Adam,
)
```
**Expected Speedup**: 10-100x for BA (depending on problem size)
#### 3.2 Distributed Matching
Process pairs across multiple GPUs:
```python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
def match_distributed(pairs, model, num_gpus=4):
"""Distribute pair matching across GPUs."""
# Split pairs across GPUs
pairs_per_gpu = len(pairs) // num_gpus
# Process in parallel
results = []
for gpu_id in range(num_gpus):
gpu_pairs = pairs[gpu_id * pairs_per_gpu:(gpu_id + 1) * pairs_per_gpu]
# Process on GPU gpu_id
results.extend(process_on_gpu(gpu_pairs, gpu_id))
return results
```
**Expected Speedup**: Linear scaling with number of GPUs
#### 3.3 Incremental BA
Instead of full BA, use incremental updates:
```python
def incremental_ba(
self,
reconstruction: pycolmap.Reconstruction,
new_images: List[str],
new_poses: np.ndarray,
):
"""Add new images incrementally and run local BA."""
# Add new images
# Run local BA (only optimize new images + neighbors)
# Full BA only periodically
```
**Expected Speedup**: 5-10x for large sequences
---
### Level 4: Research-Level (Maximum Impact)
#### 4.1 Learned Feature Matching
Use learned matchers that are faster than LightGlue:
- **LoFTR**: Attention-based, can be faster
- **QuadTree Attention**: More efficient attention mechanism
- **Sparse Matching**: Only match high-confidence features
#### 4.2 Differentiable BA
Train end-to-end with differentiable BA:
```python
from theseus import TheseusLayer
class DifferentiableBA(nn.Module):
def __init__(self):
super().__init__()
self.ba_layer = TheseusLayer(...)
def forward(self, features, initial_poses):
# Differentiable BA
refined_poses = self.ba_layer(features, initial_poses)
return refined_poses
```
**Benefit**: Can be integrated into training loop
#### 4.3 Neural BA
Replace traditional BA with a learned optimizer:
```python
class NeuralBA(nn.Module):
"""Neural network that learns to optimize BA."""
def __init__(self):
super().__init__()
self.optimizer_net = nn.Transformer(...)
def forward(self, reprojection_errors, poses):
# Learn to predict pose updates
pose_deltas = self.optimizer_net(reprojection_errors, poses)
return poses + pose_deltas
```
---
## Implementation Priority
### Phase 1: Quick Wins (1-2 days)
1. ✅ Smart pair selection (reduce pairs by 5-10x)
2. ✅ Feature caching
3. ✅ COLMAP initialization from DA3 poses
**Expected Overall Speedup**: 5-10x
### Phase 2: Moderate (1 week)
1. Batch pair matching
2. Parallel feature extraction wrapper
3. Incremental BA
**Expected Overall Speedup**: 10-20x
### Phase 3: Advanced (2-4 weeks)
1. GPU-accelerated BA (Theseus)
2. Distributed matching
3. Learned optimizations
**Expected Overall Speedup**: 20-100x
---
## Memory Optimization
### Current Memory Usage
- Features: ~1-5 MB per image (SuperPoint)
- Matches: ~0.1-1 MB per pair (LightGlue)
- COLMAP database: ~10-50 MB for 100 images
### Optimization Strategies
1. **Streaming Processing**: Process pairs in batches, don't load all at once
2. **Feature Compression**: Use half-precision (float16) for features
3. **Match Filtering**: Only store high-quality matches
4. **Garbage Collection**: Explicitly free memory after each stage
```python
import gc
import torch
def process_with_memory_management(self, images):
# Process features
features = self._extract_features(images)
del images # Free memory
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None
# Process matches
matches = self._match_features(features)
del features
gc.collect()
return matches
```
---
## Benchmarking
Create a benchmark script to measure improvements:
```python
import time
from ylff.ba_validator import BAValidator
def benchmark_ba_pipeline(images, poses, intrinsics):
validator = BAValidator()
times = {}
# Feature extraction
start = time.time()
features = validator._extract_features(images)
times['features'] = time.time() - start
# Matching
start = time.time()
matches = validator._match_features(images, features)
times['matching'] = time.time() - start
# BA
start = time.time()
result = validator._run_colmap_ba(images, features, matches, poses, intrinsics)
times['ba'] = time.time() - start
return times, result
```
---
## Recommended Implementation Order
1. **Smart Pair Selection** (Highest ROI, easiest)
2. **Feature Caching** (High ROI, easy)
3. **COLMAP Initialization** (Medium ROI, medium effort)
4. **Batch Matching** (Medium ROI, medium effort)
5. **GPU BA** (High ROI, high effort)
---
## Expected Performance
### Current (20 images, 190 pairs)
- Feature extraction: ~5s
- Matching: ~60s
- BA: ~5s
- **Total: ~70s**
### After Phase 1 (Smart pairs + caching)
- Feature extraction: ~5s (first time), ~0.1s (cached)
- Matching: ~6s (20 pairs instead of 190)
- BA: ~2s (better initialization)
- **Total: ~8s (10x speedup)**
### After Phase 2 (Batching + incremental)
- Feature extraction: ~2s
- Matching: ~3s (batched)
- BA: ~1s (incremental)
- **Total: ~6s (12x speedup)**
### After Phase 3 (GPU BA)
- Feature extraction: ~2s
- Matching: ~3s
- BA: ~0.1s (GPU)
- **Total: ~5s (14x speedup)**
---
## Next Steps
1. Implement smart pair selection
2. Add feature caching
3. Improve COLMAP initialization
4. Benchmark and iterate