# BA Pipeline Optimization Guide

## Current Bottlenecks Analysis

### 1. Feature Extraction (SuperPoint)

- **Current**: `num_workers=1` (sequential)
- **Bottleneck**: I/O and GPU utilization
- **Impact**: For 20 images, ~2-5 seconds; for 100 images, ~10-25 seconds

### 2. Feature Matching (LightGlue)

- **Current**: Sequential pair processing (`batch_size=1`)
- **Bottleneck**: GPU underutilization, sequential loop
- **Impact**: For 190 pairs (20 images), ~30-60 seconds; for 4950 pairs (100 images), ~15-30 minutes

### 3. COLMAP Reconstruction

- **Current**: Sequential incremental SfM
- **Bottleneck**: Sequential nature, many failed initializations (see log)
- **Impact**: Variable, but can be slow for large sequences

### 4. Bundle Adjustment

- **Current**: CPU-based Levenberg-Marquardt
- **Bottleneck**: Sequential optimization, no GPU acceleration
- **Impact**: Usually fast (<1s for small reconstructions), but scales poorly

---

## Optimization Strategies

### Level 1: Quick Wins (Easy, High Impact)

#### 1.1 Parallelize Feature Extraction

```python
# In ylff/ba_validator.py
def _extract_features(self, image_paths: List[str]) -> Path:
    # hloc uses num_workers=1 by default
    # We can't directly change this, but we can:
    # Option A: Process images in parallel batches
    from concurrent.futures import ThreadPoolExecutor
    import torch

    def extract_single(image_path):
        # Extract features for one image
        # This would require modifying hloc or calling SuperPoint directly
        pass

    # Option B: Use hloc's batch processing if available
    # Check if hloc supports batch_size > 1
```

**Expected Speedup**: 3-5x for feature extraction

#### 1.2 Increase Match Workers

```python
# hloc.match_features uses num_workers=5 by default
# We can't directly change this without modifying hloc source
# But we can create a wrapper that processes pairs in batches
```

**Expected Speedup**: 2-3x for matching (I/O bound)

#### 1.3 Smart Pair Selection (Reduce Pairs)

Instead of exhaustive matching (N\*(N-1)/2 pairs), use:

- **Sequential pairs**: Only match consecutive frames (N-1 pairs)
- **Sparse matching**: Match every K-th frame (N/K pairs)
- **Spatial selection**: Use DA3 poses to select nearby frames

```python
def _generate_smart_pairs(
    self,
    image_paths: List[str],
    poses: np.ndarray,
    max_baseline: float = 0.3,  # Max translation distance
    min_baseline: float = 0.05,  # Min translation distance
) -> List[Tuple[str, str]]:
    """Generate pairs based on spatial proximity."""
    pairs = []
    for i in range(len(image_paths)):
        for j in range(i + 1, len(image_paths)):
            # Compute baseline
            t_i = poses[i][:3, 3]
            t_j = poses[j][:3, 3]
            baseline = np.linalg.norm(t_i - t_j)

            if min_baseline <= baseline <= max_baseline:
                pairs.append((image_paths[i], image_paths[j]))

    return pairs
```

**Expected Speedup**: 5-10x reduction in pairs (e.g., 190 → 20-40 pairs)

---

### Level 2: Moderate Effort (Medium Impact)

#### 2.1 Batch Pair Matching

LightGlue can process multiple pairs in a single batch:

```python
class BatchedPairMatcher:
    def __init__(self, model, device, batch_size=4):
        self.model = model
        self.device = device
        self.batch_size = batch_size

    def match_batch(self, pairs_data):
        """Match multiple pairs in a single forward pass."""
        # Stack features
        features1 = torch.stack([p['feat1'] for p in pairs_data])
        features2 = torch.stack([p['feat2'] for p in pairs_data])

        # Batch matching
        matches = self.model({
            'image0': features1,
            'image1': features2,
        })

        return matches
```

**Expected Speedup**: 2-4x for matching (GPU utilization)

#### 2.2 COLMAP Initialization from DA3 Poses

Instead of letting COLMAP find initial pairs, initialize from DA3:

```python
def _initialize_from_poses(
    self,
    reconstruction: pycolmap.Reconstruction,
    initial_poses: np.ndarray,
    image_paths: List[str],
):
    """Initialize COLMAP reconstruction with DA3 poses."""
    # Add all images with initial poses
    for i, (img_path, pose) in enumerate(zip(image_paths, initial_poses)):
        # Convert w2c to c2w
        c2w = np.linalg.inv(pose)

        image = pycolmap.Image()
        image.name = Path(img_path).name
        image.set_pose(pycolmap.Pose(c2w[:3, :3], c2w[:3, 3]))
        reconstruction.add_image(image)

    # Triangulate initial points from matches
    # Then run BA
```

**Expected Speedup**: Eliminates failed initialization attempts

#### 2.3 Feature Caching

Cache extracted features to avoid re-extraction:

```python
import hashlib
import pickle

def _get_feature_cache_key(self, image_path: str) -> str:
    """Generate cache key from image hash."""
    with open(image_path, 'rb') as f:
        img_hash = hashlib.md5(f.read()).hexdigest()
    return f"features_{img_hash}"

def _extract_features_cached(self, image_paths: List[str]) -> Path:
    """Extract features with caching."""
    cache_dir = self.work_dir / "feature_cache"
    cache_dir.mkdir(exist_ok=True)

    cached_features = {}
    uncached_paths = []

    for img_path in image_paths:
        cache_key = self._get_feature_cache_key(img_path)
        cache_file = cache_dir / f"{cache_key}.pkl"

        if cache_file.exists():
            with open(cache_file, 'rb') as f:
                cached_features[img_path] = pickle.load(f)
        else:
            uncached_paths.append(img_path)

    # Extract uncached features
    if uncached_paths:
        new_features = self._extract_features(uncached_paths)
        # Cache them
        for img_path, feat in zip(uncached_paths, new_features):
            cache_key = self._get_feature_cache_key(img_path)
            cache_file = cache_dir / f"{cache_key}.pkl"
            with open(cache_file, 'wb') as f:
                pickle.dump(feat, f)

    return cached_features
```

**Expected Speedup**: 10-100x for repeated sequences

---

### Level 3: Advanced (High Impact, More Complex)

#### 3.1 GPU-Accelerated Bundle Adjustment

Use GPU-accelerated BA libraries:

**Option A: g2o (GPU)**

```python
# g2o has GPU support via CUDA
# Requires building g2o with CUDA
```

**Option B: Ceres Solver (GPU)**

```python
# Ceres has experimental GPU support
# Requires CUDA and custom build
```

**Option C: Theseus (PyTorch-based, GPU-native)**

```python
from theseus import Optimizer, CostFunction
import torch

class BundleAdjustmentCost(CostFunction):
    def __init__(self, observations, camera_params):
        # Define reprojection error
        pass

optimizer = Optimizer(
    cost_functions=[BundleAdjustmentCost(...)],
    optimizer_cls=torch.optim.Adam,
)
```

**Expected Speedup**: 10-100x for BA (depending on problem size)

#### 3.2 Distributed Matching

Process pairs across multiple GPUs:

```python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

def match_distributed(pairs, model, num_gpus=4):
    """Distribute pair matching across GPUs."""
    # Split pairs across GPUs
    pairs_per_gpu = len(pairs) // num_gpus

    # Process in parallel
    results = []
    for gpu_id in range(num_gpus):
        gpu_pairs = pairs[gpu_id * pairs_per_gpu:(gpu_id + 1) * pairs_per_gpu]
        # Process on GPU gpu_id
        results.extend(process_on_gpu(gpu_pairs, gpu_id))

    return results
```

**Expected Speedup**: Linear scaling with number of GPUs

#### 3.3 Incremental BA

Instead of full BA, use incremental updates:

```python
def incremental_ba(
    self,
    reconstruction: pycolmap.Reconstruction,
    new_images: List[str],
    new_poses: np.ndarray,
):
    """Add new images incrementally and run local BA."""
    # Add new images
    # Run local BA (only optimize new images + neighbors)
    # Full BA only periodically
```

**Expected Speedup**: 5-10x for large sequences

---

### Level 4: Research-Level (Maximum Impact)

#### 4.1 Learned Feature Matching

Use learned matchers that are faster than LightGlue:

- **LoFTR**: Attention-based, can be faster
- **QuadTree Attention**: More efficient attention mechanism
- **Sparse Matching**: Only match high-confidence features

#### 4.2 Differentiable BA

Train end-to-end with differentiable BA:

```python
from theseus import TheseusLayer

class DifferentiableBA(nn.Module):
    def __init__(self):
        super().__init__()
        self.ba_layer = TheseusLayer(...)

    def forward(self, features, initial_poses):
        # Differentiable BA
        refined_poses = self.ba_layer(features, initial_poses)
        return refined_poses
```

**Benefit**: Can be integrated into training loop

#### 4.3 Neural BA

Replace traditional BA with a learned optimizer:

```python
class NeuralBA(nn.Module):
    """Neural network that learns to optimize BA."""
    def __init__(self):
        super().__init__()
        self.optimizer_net = nn.Transformer(...)

    def forward(self, reprojection_errors, poses):
        # Learn to predict pose updates
        pose_deltas = self.optimizer_net(reprojection_errors, poses)
        return poses + pose_deltas
```

---

## Implementation Priority

### Phase 1: Quick Wins (1-2 days)

1. ✅ Smart pair selection (reduce pairs by 5-10x)
2. ✅ Feature caching
3. ✅ COLMAP initialization from DA3 poses

**Expected Overall Speedup**: 5-10x

### Phase 2: Moderate (1 week)

1. Batch pair matching
2. Parallel feature extraction wrapper
3. Incremental BA

**Expected Overall Speedup**: 10-20x

### Phase 3: Advanced (2-4 weeks)

1. GPU-accelerated BA (Theseus)
2. Distributed matching
3. Learned optimizations

**Expected Overall Speedup**: 20-100x

---

## Memory Optimization

### Current Memory Usage

- Features: ~1-5 MB per image (SuperPoint)
- Matches: ~0.1-1 MB per pair (LightGlue)
- COLMAP database: ~10-50 MB for 100 images

### Optimization Strategies

1. **Streaming Processing**: Process pairs in batches, don't load all at once
2. **Feature Compression**: Use half-precision (float16) for features
3. **Match Filtering**: Only store high-quality matches
4. **Garbage Collection**: Explicitly free memory after each stage

```python
import gc
import torch

def process_with_memory_management(self, images):
    # Process features
    features = self._extract_features(images)
    del images  # Free memory
    gc.collect()
    torch.cuda.empty_cache() if torch.cuda.is_available() else None

    # Process matches
    matches = self._match_features(features)
    del features
    gc.collect()

    return matches
```

---

## Benchmarking

Create a benchmark script to measure improvements:

```python
import time
from ylff.ba_validator import BAValidator

def benchmark_ba_pipeline(images, poses, intrinsics):
    validator = BAValidator()

    times = {}

    # Feature extraction
    start = time.time()
    features = validator._extract_features(images)
    times['features'] = time.time() - start

    # Matching
    start = time.time()
    matches = validator._match_features(images, features)
    times['matching'] = time.time() - start

    # BA
    start = time.time()
    result = validator._run_colmap_ba(images, features, matches, poses, intrinsics)
    times['ba'] = time.time() - start

    return times, result
```

---

## Recommended Implementation Order

1. **Smart Pair Selection** (Highest ROI, easiest)
2. **Feature Caching** (High ROI, easy)
3. **COLMAP Initialization** (Medium ROI, medium effort)
4. **Batch Matching** (Medium ROI, medium effort)
5. **GPU BA** (High ROI, high effort)

---

## Expected Performance

### Current (20 images, 190 pairs)

- Feature extraction: ~5s
- Matching: ~60s
- BA: ~5s
- **Total: ~70s**

### After Phase 1 (Smart pairs + caching)

- Feature extraction: ~5s (first time), ~0.1s (cached)
- Matching: ~6s (20 pairs instead of 190)
- BA: ~2s (better initialization)
- **Total: ~8s (10x speedup)**

### After Phase 2 (Batching + incremental)

- Feature extraction: ~2s
- Matching: ~3s (batched)
- BA: ~1s (incremental)
- **Total: ~6s (12x speedup)**

### After Phase 3 (GPU BA)

- Feature extraction: ~2s
- Matching: ~3s
- BA: ~0.1s (GPU)
- **Total: ~5s (14x speedup)**

---

## Next Steps

1. Implement smart pair selection
2. Add feature caching
3. Improve COLMAP initialization
4. Benchmark and iterate