3d_model / research_docs /RESEARCH_FOCUS.md
Azan
Clean deployment build (Squashed)
7a87926

Research Focus: BA-Supervised Learning for Visual Geometry

Table of Contents

  1. Executive Summary
  2. Current Implementation (YLFF)
  3. Research Questions
  4. Component Analysis
  5. Data Quality Hierarchy
  6. Differentiability & GPU Parallelization
  7. Future Research Directions
  8. Implementation Roadmap

Executive Summary

This document outlines the research focus for You Learn From Failure (YLFF), a framework for improving visual geometry models using Bundle Adjustment (BA) as an oracle teacher. The core hypothesis is that BA provides a robust, geometrically consistent supervision signal that can be used for both fine-tuning and large-scale pre-training.

Key Insights

  • BA as Oracle: Bundle Adjustment provides geometrically consistent poses and depths that can serve as high-quality supervision
  • ARKit as Data Source: Real-world ARKit captures provide diverse, natural motion data for training
  • Differentiable Components: Modern research is making traditionally non-differentiable steps (matching, RANSAC, BA) differentiable
  • GPU Parallelization: Most components can be parallelized, but BA remains a bottleneck

Research Philosophy

Don't treat ARKit as ground truth - it's derived. Focus on what's differentiable and GPU-parallel. Use geometric constraints as losses, not hard solves. Learn uncertainty, but calibrate it properly. LiDAR sparse depth is your best "ground truth" for depth.


Current Implementation (YLFF)

Overview

YLFF is a complete framework for BA-supervised learning, currently implemented with:

  • βœ… BA Validation Pipeline: COLMAP-based validation using SuperPoint + LightGlue
  • βœ… ARKit Integration: Full support for ARKit video and metadata processing
  • βœ… Fine-Tuning: Train on failure cases using BA poses as pseudo-labels
  • βœ… Pre-Training: Large-scale training on ARKit sequences with BA supervision
  • βœ… Real-time Visualization: GUI for monitoring validation progress
  • βœ… Model Selection: Automatic selection of best DA3 model (DA3NESTED-GIANT-LARGE for BA workflows)

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    YLFF Pipeline                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Data Collection (Device)                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ ARKit Capture β†’ Images + LiDAR + VIO Poses + Metadata   β”‚  β”‚
β”‚  β”‚ (passive collection, all quality levels)                 β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                         ↓                                       β”‚
β”‚  BA Validation (Offline)                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ 1. Run DA3 β†’ Poses_DA3, Depths_DA3                      β”‚  β”‚
β”‚  β”‚ 2. Extract SuperPoint features                           β”‚  β”‚
β”‚  β”‚ 3. Match with LightGlue                                   β”‚  β”‚
β”‚  β”‚ 4. Run COLMAP BA β†’ Poses_BA                              β”‚  β”‚
β”‚  β”‚ 5. Compare: error = ||Poses_DA3 - Poses_BA||            β”‚  β”‚
β”‚  β”‚ 6. Categorize:                                            β”‚  β”‚
β”‚  β”‚    - Accept (error < 2Β°): Model good, skip              β”‚  β”‚
β”‚  β”‚    - Reject-Learnable (2Β° < error < 30Β°): TRAIN         β”‚  β”‚
β”‚  β”‚    - Reject-Outlier (error > 30Β°): Discard              β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                         ↓                                       β”‚
β”‚  Training (GPU)                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Fine-Tuning: Train on rejected-learnable samples         β”‚  β”‚
β”‚  β”‚ Pre-Training: Train on all ARKit sequences               β”‚  β”‚
β”‚  β”‚ Loss: pose_loss(model(images), Poses_BA)                β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Current Capabilities

1. BA Validation (ylff validate)

  • Sequence Validation: Validate any image sequence with BA
  • ARKit Validation: Specialized pipeline for ARKit data with ground truth comparison
  • Real-time GUI: Monitor validation progress with live visualization
  • Feature Caching: Optimized feature extraction with caching
  • Smart Pairing: Reduce matching pairs for faster processing

2. Dataset Building (ylff dataset build)

  • Automatic Curation: Process sequences and categorize by BA agreement
  • Training Set Generation: Build training sets from rejected-learnable samples
  • Quality Filtering: Filter by BA quality, reprojection error, etc.

3. Fine-Tuning (ylff train start)

  • BA-Supervised Training: Train on failure cases using BA poses as labels
  • Weighted Loss: Weight samples by error magnitude
  • Checkpointing: Save model checkpoints during training

4. Pre-Training (ylff train pretrain)

  • ARKit-Scale Training: Process hundreds of ARKit sequences
  • BA as Teacher: Use BA poses and depths as supervision
  • Optional Depth Supervision: Use BA depth maps or LiDAR depth
  • Quality Filtering: Filter sequences by BA quality

5. Evaluation (ylff eval)

  • BA Agreement Rate: Measure % of samples with error < threshold
  • Pose Error Metrics: Rotation and translation errors
  • Comparison Tools: Compare model predictions vs BA vs ARKit

Implementation Status

Component Status Notes
BA Validation βœ… Complete COLMAP + SuperPoint + LightGlue
ARKit Processing βœ… Complete Video + metadata extraction
Fine-Tuning βœ… Complete BA-supervised training loop
Pre-Training βœ… Complete ARKit-scale training pipeline
Feature Caching βœ… Complete H5-based caching for speed
Smart Pairing βœ… Complete Sequential/spatial pairing modes
GUI Visualization βœ… Complete Real-time Tkinter GUI
Model Selection βœ… Complete Auto-select best DA3 model
Differentiable BA ❌ Not Implemented Future research direction
Uncertainty Calibration ❌ Not Implemented Future research direction

Key Design Decisions

  1. COLMAP BA (Not Differentiable): Using traditional COLMAP BA for validation. Differentiable BA is a future research direction.

  2. ARKit as Data Source: Leveraging real-world ARKit captures for diverse training data.

  3. BA as Oracle: Treating BA as the ground truth teacher, not ARKit VIO poses.

  4. Failure-Focused Fine-Tuning: Training on cases where model fails, not all data.

  5. Scale-Aware Pre-Training: Pre-training on all ARKit sequences for large-scale learning.


Research Questions

1. Can BA Serve as an Oracle Teacher?

Question: Is BA robust enough to provide reliable supervision signals for training?

Hypothesis: Yes - BA provides geometrically consistent poses that are more reliable than VIO alone.

Validation:

  • βœ… Fine-tuning on BA-supervised data improves model accuracy
  • πŸ”„ Pre-training on ARKit sequences validates scalability
  • ❓ Long-term: Does BA-supervised model generalize better?

Status: In Progress - Fine-tuning works, pre-training being validated.

2. Can Differentiable BA Match Traditional BA Quality?

Question: Can differentiable BA (Theseus, gradSLAM) achieve the same quality as COLMAP?

Hypothesis: Potentially, but convergence and local minima are challenges.

Research Direction:

  • Differentiable BA libraries: Theseus, gradSLAM, PyPose
  • End-to-end training with BA in the loop
  • GPU-accelerated BA using PCG (Preconditioned Conjugate Gradient)

Status: Future Work - Currently using COLMAP (non-differentiable).

3. Can Uncertainty Be Properly Calibrated End-to-End?

Question: Can we learn uncertainty that is both useful and calibrated?

Current State: DA3/VGGT confidence heads are trained but not calibrated.

Research Opportunity:

  • Proper calibration requires held-out validation
  • Uncertainty that means something (not just confidence scores)
  • End-to-end uncertainty learning with calibration loss

Status: Future Work - Not yet implemented.

4. What's the Right Inductive Bias?

Question: What architecture best enforces geometric constraints?

Current Approaches:

  • VGGT: Point maps (direct 3D)
  • DA3: Depth + rays (decomposed)
  • Neither: Enforces epipolar geometry

Research Opportunity: Architecture that enforces geometric constraints as part of the model, not just losses.

Status: Future Work - Current models don't enforce epipolar constraints.

5. Can Learned Sensor Fusion Beat ARKit's VIO?

Question: Can learned approaches outperform Apple's years of VIO engineering?

Challenge: Requires raw IMU data (ARKit gives poses, not raw IMU).

Potential: Learned approaches could generalize better across devices/scenarios.

Status: Future Work - Requires raw IMU data access.


Component Analysis

Component Comparison Matrix

Component ARKit COLMAP+hloc DA3 VGGT YLFF
Feature Extraction Proprietary (ORB-like?) SIFT or SuperPoint DINOv2 (implicit) DINOv2 (implicit) SuperPoint
Feature Matching KLT tracking + IMU Exhaustive or SuperGlue Cross-view attention Cross-view attention LightGlue
Relative Pose VIO (EKF/factor graph) Essential matrix + RANSAC Learned (ray head) Learned (camera head) BA-refined
Global Optimization NONE (drift accumulates) Bundle Adjustment βœ“ NONE NONE COLMAP BA
Dense Depth LiDAR (sparse) MVS (slow) Learned (fast) Learned (fast) DA3 + BA depth
Metric Scale IMU βœ“ NONE (up to scale) Trained with f_c=300 Normalized BA (metric)
Uncertainty trackingState (discrete) Covariance from BA Confidence head Confidence head BA quality metrics

What's "Raw" vs "Derived" in ARKit

Data Raw or Derived? Source
Camera pixels Raw CMOS sensor readout
IMU acceleration Raw Accelerometer (100Hz+)
IMU angular velocity Raw Gyroscope (100Hz+)
LiDAR ToF returns Raw Time-of-flight pulses
Camera intrinsics Calibrated (stored) Factory calibration
--- --- ---
Poses (transform) DERIVED VIO algorithm (EKF/factor graph)
Tracking state DERIVED VIO internal heuristics
Feature points DERIVED ARKit's detector (ORB-like?)
Plane anchors DERIVED RANSAC plane fitting
Depth confidence DERIVED ARKit's confidence model
World mapping status DERIVED SLAM internal state

Key Insight: ARKit poses are derived, not ground truth. BA provides a more robust signal.


Data Quality Hierarchy

Quality Pyramid

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA QUALITY PYRAMID                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                           β”‚
β”‚  β”‚   LiDAR depth   β”‚  ← Strongest signal (metric, direct ToF)  β”‚
β”‚  β”‚  (high conf)    β”‚                                           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                           β”‚
β”‚           ↓                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                   β”‚
β”‚  β”‚  BA poses (refined)     β”‚  ← YLFF uses this as teacher     β”‚
β”‚  β”‚  (geometrically          β”‚                                   β”‚
β”‚  β”‚   consistent)            β”‚                                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                   β”‚
β”‚           ↓                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                   β”‚
β”‚  β”‚  ARKit poses (normal +   β”‚  ← Good signal (use for training)β”‚
β”‚  β”‚  worldMappingStatus=    β”‚                                   β”‚
β”‚  β”‚  extending/mapped)       β”‚                                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                   β”‚
β”‚           ↓                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚  β”‚  ARKit poses (limited +         β”‚  ← Weak signal (noisy     β”‚
β”‚  β”‚  high featurePointCount)         β”‚     but still useful)     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚           ↓                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚  β”‚  ARKit poses (limited, excessiveMotionβ”‚  ← Negative signal  β”‚
β”‚  β”‚  or insufficientFeatures, low feature  β”‚    (learn to detect β”‚
β”‚  β”‚  count)                               β”‚    and reject)      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Collection Strategy

Phase 1: Collect Everything (Passive)

  • Capture all ARKit data regardless of quality
  • Store raw video, metadata, LiDAR depth
  • No filtering at capture time

Phase 2: Automated Quality Filters

  • Reject trackingState=notAvailable
  • Reject worldMappingStatus=notAvailable
  • Reject featurePointCount < threshold
  • Reject LiDAR coverage < threshold

Phase 3: Human Review of Edge Cases

  • Borderline tracking quality
  • Unusual scenes (reflective, texture-poor)
  • Sequences where VIO drifted then recovered

Phase 4: Build Labeled Dataset

  • High-quality subset for supervised training
  • Noisy subset for robust/uncertainty training
  • Rejected subset for failure detection

Differentiability & GPU Parallelization

Making Traditional Pipeline Differentiable

Step Traditional Differentiable Version Status
Matching argmax over scores Soft matching (Sinkhorn in SuperGlue) βœ… Available
RANSAC Discrete sampling Differentiable RANSAC (DSAC, NG-RANSAC) βœ… Available
Essential Matrix SVD decomposition Differentiable SVD (PyTorch) βœ… Available
Bundle Adjustment Gauss-Newton Differentiable BA (Theseus, gradSLAM) βœ… Available
Triangulation Linear least squares Differentiable triangulation βœ… Available

GPU Parallelizability Analysis

Operation GPU-Parallel? Why/Why Not YLFF Status
Feature extraction βœ… YES Per-image, embarrassingly parallel βœ… Implemented
Dense feature maps (DINOv2) βœ… YES Forward pass, batched βœ… (DA3 uses this)
Feature matching (exhaustive) ⚠️ Partial O(NΒ²) pairs, but each pair is parallel βœ… (LightGlue)
Attention (cross-view) βœ… YES Matrix ops, highly optimized βœ… (DA3 uses this)
RANSAC ❌ NO Sequential hypothesis testing ❌ (Not used)
Essential matrix (8-point) ⚠️ Partial Linear algebra, but per-pair ❌ (Not used)
Bundle Adjustment ❌ NO Iterative optimization, sparse solve ❌ (COLMAP, CPU)
Cost volume (MVS) βœ… YES 3D convolutions, embarrassingly parallel ❌ (Not used)
Dense depth (learned) βœ… YES Forward pass βœ… (DA3)
Differentiable rendering βœ… YES Per-pixel, parallel ❌ (Not used)

GPU-Parallelizable BA Components

Component Standard GPU Version Status
Jacobian Per-observation loop Batched einsum πŸ”„ Future
Hessian Accumulate blocks Scatter-add πŸ”„ Future
Linear solve Cholesky (sequential) PCG (parallel) πŸ”„ Future
Robust loss Per-residual Vectorized πŸ”„ Future

BA Software Landscape

Library Language GPU? Differentiable? Notes
Ceres Solver C++ ❌ ❌ Google, most mature, COLMAP uses this
g2o C++ ❌ ❌ Graph-based, ORB-SLAM uses this
GTSAM C++ ❌ ❌ Factor graphs, iSAM2 incremental
Theseus Python/C++ βœ… βœ… Meta, differentiable optimization
gradSLAM Python βœ… βœ… Differentiable SLAM components
PyPose Python βœ… βœ… SE(3) operations, batched
lietorch Python βœ… βœ… Lie group operations for SLAM

YLFF Current Choice: COLMAP (Ceres Solver) - non-differentiable, CPU-based, but proven and reliable.

Future Direction: Migrate to Theseus or gradSLAM for differentiable, GPU-accelerated BA.

BA Methods Comparison

Method Formula Pros Cons Used By
Gauss-Newton (GN) Ξ”x = -(Jα΅€J)⁻¹ Jα΅€r Quadratic convergence near minimum Can diverge if far from minimum Many
Levenberg-Marquardt (LM) Ξ”x = -(Jα΅€J + Ξ»I)⁻¹ Jα΅€r Interpolates GN and GD, more stable Need to tune Ξ» schedule COLMAP
Dogleg (Powell) Trust region method More robust than LM More complex implementation Some
Preconditioned Conjugate Gradient (PCG) Iterative solver Scales to very large problems, GPU-friendly Slower convergence, needs good preconditioner Large-scale

Computational Complexity

Operation Complexity GPU-Parallel? YLFF Status
Jacobian computation O(K) per obs, O(NΓ—MΓ—kΜ„) total βœ… YES βœ… (COLMAP)
Hessian blocks O(K) per obs βœ… YES βœ… (COLMAP)
Schur complement formation O(NΒ² Γ— kΜ„) ⚠️ Partial βœ… (COLMAP)
Reduced system solve O(NΒ³) dense, O(N Γ— kΜ„Β²) sparse ❌ Hard βœ… (COLMAP, CPU)
Point back-substitution O(M Γ— kΜ„Β²) βœ… YES βœ… (COLMAP)

Future Research Directions

1. Differentiable BA Integration

Goal: Replace COLMAP BA with differentiable BA for end-to-end training.

Approach:

  • Integrate Theseus or gradSLAM
  • GPU-accelerated BA using PCG
  • End-to-end training with BA in the loop

Benefits:

  • Gradient flow through BA
  • GPU acceleration
  • End-to-end optimization

Challenges:

  • Convergence guarantees
  • Local minima
  • Computational cost

2. Research-Oriented Pipeline

Vision: Fully GPU-parallel, differentiable pipeline.

# Forward pass (fully GPU-parallel)
features = backbone(images)  # [N, H, W, C]
poses_pred, pose_uncertainty = pose_head(features)  # [N, 4, 4], [N, 6, 6]
depths_pred, depth_uncertainty = depth_head(features)  # [N, H, W], [N, H, W]
correspondences = match_head(features)  # [N, N, K, 2] soft matches

# Differentiable geometric losses (still GPU-parallel)
epipolar_loss = epipolar_error(correspondences, poses_pred)
reproj_loss = reprojection_error(depths_pred, poses_pred, correspondences)
lidar_loss = l1_loss(depths_pred, lidar_sparse, weights=lidar_confidence)

# Optional: differentiable BA refinement (GPU, but iterative)
poses_refined = differentiable_ba(poses_pred, correspondences, depths_pred)

# Total loss with uncertainty weighting
loss = (epipolar_loss / pose_uncertainty.det() +
        reproj_loss / depth_uncertainty +
        lidar_loss)

Architecture:

Images β†’ Backbone (DINOv2/ViT) β†’ Dense Features
         ↓
    Cross-View Attention (all pairs)
         ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    ↓             ↓             ↓
Pose Head    Depth Head   Correspondence Head
(per-frame)  (per-pixel)  (soft matches)
    ↓             ↓             ↓
Poses [N,4,4] Depths [N,H,W] Matches [N,N,K,K]
    ↓
Differentiable Triangulation
    ↓
Differentiable Reprojection Loss

3. Uncertainty Calibration

Goal: Learn properly calibrated uncertainty.

Approach:

  • Calibration loss on held-out validation set
  • Uncertainty that means something (not just confidence scores)
  • End-to-end uncertainty learning

Research Questions:

  • How to calibrate pose uncertainty?
  • How to calibrate depth uncertainty?
  • Can uncertainty predict BA failure?

4. Geometric Constraint Enforcement

Goal: Architecture that enforces geometric constraints.

Current Gap: DA3 and VGGT don't enforce epipolar geometry.

Research Direction:

  • Epipolar constraint as differentiable loss
  • Reprojection consistency as loss
  • Architecture that naturally satisfies constraints

5. Multi-Modal Fusion

Goal: Learn from camera + LiDAR + IMU.

Challenge: ARKit doesn't provide raw IMU data.

Research Direction:

  • LiDAR depth as supervision signal
  • IMU integration (if raw data available)
  • Learned sensor fusion

6. Self-Improving Pipeline

Vision: Pipeline that gets better over time.

1. Collect data (ARKit + LiDAR + images)
   ↓
2. Run DA3 (fast inference)
   ↓
3. Run BA validation (slow, offline)
   ↓
4. Filter/label samples:
   - DA3 agrees with BA β†’ high-quality training sample
   - DA3 disagrees β†’ use BA as pseudo-label, retrain DA3
   - Both fail β†’ reject, but save for failure analysis
   ↓
5. Retrain DA3 on curated data
   ↓
6. Repeat β†’ DA3 gets better over time

YLFF Status: βœ… Implemented for fine-tuning, πŸ”„ Being validated for pre-training.


Implementation Roadmap

Phase 1: Foundation (βœ… Complete)

  • BA validation pipeline (COLMAP + SuperPoint + LightGlue)
  • ARKit data processing
  • Fine-tuning on failure cases
  • Pre-training on ARKit sequences
  • Real-time visualization
  • Model selection and recommendations

Phase 2: Optimization (πŸ”„ In Progress)

  • Feature caching
  • Smart pair selection
  • GPU-accelerated feature extraction
  • Parallel BA processing
  • Distributed training

Phase 3: Differentiability (πŸ“‹ Planned)

  • Integrate Theseus or gradSLAM
  • Differentiable BA in training loop
  • End-to-end gradient flow
  • GPU-accelerated BA

Phase 4: Advanced Features (πŸ“‹ Planned)

  • Uncertainty calibration
  • Geometric constraint enforcement
  • Multi-modal fusion (LiDAR + IMU)
  • Self-improving pipeline automation

Phase 5: Research Contributions (πŸ“‹ Future)

  • Novel architecture with geometric constraints
  • Calibrated uncertainty learning
  • Differentiable BA improvements
  • Learned sensor fusion

Key Takeaways

  1. BA as Oracle: BA provides robust supervision for training visual geometry models.

  2. ARKit as Data Source: Real-world captures provide diverse, natural training data.

  3. Differentiability is Coming: Modern research is making traditionally non-differentiable steps differentiable.

  4. GPU Parallelization: Most components can be parallelized, but BA remains a bottleneck.

  5. YLFF is the Foundation: Current implementation provides the infrastructure for future research.

  6. Research Opportunities: Uncertainty calibration, geometric constraints, differentiable BA, multi-modal fusion.


References

  • DA3: Depth Anything 3 - Unified depth-ray representation
  • VGGT: Visual Geometry Grounded Transformer
  • COLMAP: Structure-from-Motion and Multi-View Stereo
  • Theseus: Differentiable optimization library (Meta)
  • gradSLAM: Differentiable SLAM components
  • SuperPoint/SuperGlue: Learned feature extraction and matching
  • LightGlue: Fast learned feature matching

Last Updated: 2024 Project: You Learn From Failure (YLFF)