Research Focus: BA-Supervised Learning for Visual Geometry
Table of Contents
- Executive Summary
- Current Implementation (YLFF)
- Research Questions
- Component Analysis
- Data Quality Hierarchy
- Differentiability & GPU Parallelization
- Future Research Directions
- Implementation Roadmap
Executive Summary
This document outlines the research focus for You Learn From Failure (YLFF), a framework for improving visual geometry models using Bundle Adjustment (BA) as an oracle teacher. The core hypothesis is that BA provides a robust, geometrically consistent supervision signal that can be used for both fine-tuning and large-scale pre-training.
Key Insights
- BA as Oracle: Bundle Adjustment provides geometrically consistent poses and depths that can serve as high-quality supervision
- ARKit as Data Source: Real-world ARKit captures provide diverse, natural motion data for training
- Differentiable Components: Modern research is making traditionally non-differentiable steps (matching, RANSAC, BA) differentiable
- GPU Parallelization: Most components can be parallelized, but BA remains a bottleneck
Research Philosophy
Don't treat ARKit as ground truth - it's derived. Focus on what's differentiable and GPU-parallel. Use geometric constraints as losses, not hard solves. Learn uncertainty, but calibrate it properly. LiDAR sparse depth is your best "ground truth" for depth.
Current Implementation (YLFF)
Overview
YLFF is a complete framework for BA-supervised learning, currently implemented with:
- β BA Validation Pipeline: COLMAP-based validation using SuperPoint + LightGlue
- β ARKit Integration: Full support for ARKit video and metadata processing
- β Fine-Tuning: Train on failure cases using BA poses as pseudo-labels
- β Pre-Training: Large-scale training on ARKit sequences with BA supervision
- β Real-time Visualization: GUI for monitoring validation progress
- β Model Selection: Automatic selection of best DA3 model (DA3NESTED-GIANT-LARGE for BA workflows)
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β YLFF Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Collection (Device) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ARKit Capture β Images + LiDAR + VIO Poses + Metadata β β
β β (passive collection, all quality levels) β β
β ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β BA Validation (Offline) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Run DA3 β Poses_DA3, Depths_DA3 β β
β β 2. Extract SuperPoint features β β
β β 3. Match with LightGlue β β
β β 4. Run COLMAP BA β Poses_BA β β
β β 5. Compare: error = ||Poses_DA3 - Poses_BA|| β β
β β 6. Categorize: β β
β β - Accept (error < 2Β°): Model good, skip β β
β β - Reject-Learnable (2Β° < error < 30Β°): TRAIN β β
β β - Reject-Outlier (error > 30Β°): Discard β β
β ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β Training (GPU) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Fine-Tuning: Train on rejected-learnable samples β β
β β Pre-Training: Train on all ARKit sequences β β
β β Loss: pose_loss(model(images), Poses_BA) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Current Capabilities
1. BA Validation (ylff validate)
- Sequence Validation: Validate any image sequence with BA
- ARKit Validation: Specialized pipeline for ARKit data with ground truth comparison
- Real-time GUI: Monitor validation progress with live visualization
- Feature Caching: Optimized feature extraction with caching
- Smart Pairing: Reduce matching pairs for faster processing
2. Dataset Building (ylff dataset build)
- Automatic Curation: Process sequences and categorize by BA agreement
- Training Set Generation: Build training sets from rejected-learnable samples
- Quality Filtering: Filter by BA quality, reprojection error, etc.
3. Fine-Tuning (ylff train start)
- BA-Supervised Training: Train on failure cases using BA poses as labels
- Weighted Loss: Weight samples by error magnitude
- Checkpointing: Save model checkpoints during training
4. Pre-Training (ylff train pretrain)
- ARKit-Scale Training: Process hundreds of ARKit sequences
- BA as Teacher: Use BA poses and depths as supervision
- Optional Depth Supervision: Use BA depth maps or LiDAR depth
- Quality Filtering: Filter sequences by BA quality
5. Evaluation (ylff eval)
- BA Agreement Rate: Measure % of samples with error < threshold
- Pose Error Metrics: Rotation and translation errors
- Comparison Tools: Compare model predictions vs BA vs ARKit
Implementation Status
| Component | Status | Notes |
|---|---|---|
| BA Validation | β Complete | COLMAP + SuperPoint + LightGlue |
| ARKit Processing | β Complete | Video + metadata extraction |
| Fine-Tuning | β Complete | BA-supervised training loop |
| Pre-Training | β Complete | ARKit-scale training pipeline |
| Feature Caching | β Complete | H5-based caching for speed |
| Smart Pairing | β Complete | Sequential/spatial pairing modes |
| GUI Visualization | β Complete | Real-time Tkinter GUI |
| Model Selection | β Complete | Auto-select best DA3 model |
| Differentiable BA | β Not Implemented | Future research direction |
| Uncertainty Calibration | β Not Implemented | Future research direction |
Key Design Decisions
COLMAP BA (Not Differentiable): Using traditional COLMAP BA for validation. Differentiable BA is a future research direction.
ARKit as Data Source: Leveraging real-world ARKit captures for diverse training data.
BA as Oracle: Treating BA as the ground truth teacher, not ARKit VIO poses.
Failure-Focused Fine-Tuning: Training on cases where model fails, not all data.
Scale-Aware Pre-Training: Pre-training on all ARKit sequences for large-scale learning.
Research Questions
1. Can BA Serve as an Oracle Teacher?
Question: Is BA robust enough to provide reliable supervision signals for training?
Hypothesis: Yes - BA provides geometrically consistent poses that are more reliable than VIO alone.
Validation:
- β Fine-tuning on BA-supervised data improves model accuracy
- π Pre-training on ARKit sequences validates scalability
- β Long-term: Does BA-supervised model generalize better?
Status: In Progress - Fine-tuning works, pre-training being validated.
2. Can Differentiable BA Match Traditional BA Quality?
Question: Can differentiable BA (Theseus, gradSLAM) achieve the same quality as COLMAP?
Hypothesis: Potentially, but convergence and local minima are challenges.
Research Direction:
- Differentiable BA libraries: Theseus, gradSLAM, PyPose
- End-to-end training with BA in the loop
- GPU-accelerated BA using PCG (Preconditioned Conjugate Gradient)
Status: Future Work - Currently using COLMAP (non-differentiable).
3. Can Uncertainty Be Properly Calibrated End-to-End?
Question: Can we learn uncertainty that is both useful and calibrated?
Current State: DA3/VGGT confidence heads are trained but not calibrated.
Research Opportunity:
- Proper calibration requires held-out validation
- Uncertainty that means something (not just confidence scores)
- End-to-end uncertainty learning with calibration loss
Status: Future Work - Not yet implemented.
4. What's the Right Inductive Bias?
Question: What architecture best enforces geometric constraints?
Current Approaches:
- VGGT: Point maps (direct 3D)
- DA3: Depth + rays (decomposed)
- Neither: Enforces epipolar geometry
Research Opportunity: Architecture that enforces geometric constraints as part of the model, not just losses.
Status: Future Work - Current models don't enforce epipolar constraints.
5. Can Learned Sensor Fusion Beat ARKit's VIO?
Question: Can learned approaches outperform Apple's years of VIO engineering?
Challenge: Requires raw IMU data (ARKit gives poses, not raw IMU).
Potential: Learned approaches could generalize better across devices/scenarios.
Status: Future Work - Requires raw IMU data access.
Component Analysis
Component Comparison Matrix
| Component | ARKit | COLMAP+hloc | DA3 | VGGT | YLFF |
|---|---|---|---|---|---|
| Feature Extraction | Proprietary (ORB-like?) | SIFT or SuperPoint | DINOv2 (implicit) | DINOv2 (implicit) | SuperPoint |
| Feature Matching | KLT tracking + IMU | Exhaustive or SuperGlue | Cross-view attention | Cross-view attention | LightGlue |
| Relative Pose | VIO (EKF/factor graph) | Essential matrix + RANSAC | Learned (ray head) | Learned (camera head) | BA-refined |
| Global Optimization | NONE (drift accumulates) | Bundle Adjustment β | NONE | NONE | COLMAP BA |
| Dense Depth | LiDAR (sparse) | MVS (slow) | Learned (fast) | Learned (fast) | DA3 + BA depth |
| Metric Scale | IMU β | NONE (up to scale) | Trained with f_c=300 | Normalized | BA (metric) |
| Uncertainty | trackingState (discrete) | Covariance from BA | Confidence head | Confidence head | BA quality metrics |
What's "Raw" vs "Derived" in ARKit
| Data | Raw or Derived? | Source |
|---|---|---|
| Camera pixels | Raw | CMOS sensor readout |
| IMU acceleration | Raw | Accelerometer (100Hz+) |
| IMU angular velocity | Raw | Gyroscope (100Hz+) |
| LiDAR ToF returns | Raw | Time-of-flight pulses |
| Camera intrinsics | Calibrated (stored) | Factory calibration |
| --- | --- | --- |
| Poses (transform) | DERIVED | VIO algorithm (EKF/factor graph) |
| Tracking state | DERIVED | VIO internal heuristics |
| Feature points | DERIVED | ARKit's detector (ORB-like?) |
| Plane anchors | DERIVED | RANSAC plane fitting |
| Depth confidence | DERIVED | ARKit's confidence model |
| World mapping status | DERIVED | SLAM internal state |
Key Insight: ARKit poses are derived, not ground truth. BA provides a more robust signal.
Data Quality Hierarchy
Quality Pyramid
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA QUALITY PYRAMID β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ β
β β LiDAR depth β β Strongest signal (metric, direct ToF) β
β β (high conf) β β
β ββββββββββ¬βββββββββ β
β β β
β βββββββββββββββββββββββββββ β
β β BA poses (refined) β β YLFF uses this as teacher β
β β (geometrically β β
β β consistent) β β
β ββββββββββ¬βββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββ β
β β ARKit poses (normal + β β Good signal (use for training)β
β β worldMappingStatus= β β
β β extending/mapped) β β
β ββββββββββ¬βββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββ β
β β ARKit poses (limited + β β Weak signal (noisy β
β β high featurePointCount) β but still useful) β
β ββββββββββ¬βββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β ARKit poses (limited, excessiveMotionβ β Negative signal β
β β or insufficientFeatures, low feature β (learn to detect β
β β count) β and reject) β
β βββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Collection Strategy
Phase 1: Collect Everything (Passive)
- Capture all ARKit data regardless of quality
- Store raw video, metadata, LiDAR depth
- No filtering at capture time
Phase 2: Automated Quality Filters
- Reject
trackingState=notAvailable - Reject
worldMappingStatus=notAvailable - Reject
featurePointCount < threshold - Reject LiDAR coverage < threshold
Phase 3: Human Review of Edge Cases
- Borderline tracking quality
- Unusual scenes (reflective, texture-poor)
- Sequences where VIO drifted then recovered
Phase 4: Build Labeled Dataset
- High-quality subset for supervised training
- Noisy subset for robust/uncertainty training
- Rejected subset for failure detection
Differentiability & GPU Parallelization
Making Traditional Pipeline Differentiable
| Step | Traditional | Differentiable Version | Status |
|---|---|---|---|
| Matching | argmax over scores | Soft matching (Sinkhorn in SuperGlue) | β Available |
| RANSAC | Discrete sampling | Differentiable RANSAC (DSAC, NG-RANSAC) | β Available |
| Essential Matrix | SVD decomposition | Differentiable SVD (PyTorch) | β Available |
| Bundle Adjustment | Gauss-Newton | Differentiable BA (Theseus, gradSLAM) | β Available |
| Triangulation | Linear least squares | Differentiable triangulation | β Available |
GPU Parallelizability Analysis
| Operation | GPU-Parallel? | Why/Why Not | YLFF Status |
|---|---|---|---|
| Feature extraction | β YES | Per-image, embarrassingly parallel | β Implemented |
| Dense feature maps (DINOv2) | β YES | Forward pass, batched | β (DA3 uses this) |
| Feature matching (exhaustive) | β οΈ Partial | O(NΒ²) pairs, but each pair is parallel | β (LightGlue) |
| Attention (cross-view) | β YES | Matrix ops, highly optimized | β (DA3 uses this) |
| RANSAC | β NO | Sequential hypothesis testing | β (Not used) |
| Essential matrix (8-point) | β οΈ Partial | Linear algebra, but per-pair | β (Not used) |
| Bundle Adjustment | β NO | Iterative optimization, sparse solve | β (COLMAP, CPU) |
| Cost volume (MVS) | β YES | 3D convolutions, embarrassingly parallel | β (Not used) |
| Dense depth (learned) | β YES | Forward pass | β (DA3) |
| Differentiable rendering | β YES | Per-pixel, parallel | β (Not used) |
GPU-Parallelizable BA Components
| Component | Standard | GPU Version | Status |
|---|---|---|---|
| Jacobian | Per-observation loop | Batched einsum | π Future |
| Hessian | Accumulate blocks | Scatter-add | π Future |
| Linear solve | Cholesky (sequential) | PCG (parallel) | π Future |
| Robust loss | Per-residual | Vectorized | π Future |
BA Software Landscape
| Library | Language | GPU? | Differentiable? | Notes |
|---|---|---|---|---|
| Ceres Solver | C++ | β | β | Google, most mature, COLMAP uses this |
| g2o | C++ | β | β | Graph-based, ORB-SLAM uses this |
| GTSAM | C++ | β | β | Factor graphs, iSAM2 incremental |
| Theseus | Python/C++ | β | β | Meta, differentiable optimization |
| gradSLAM | Python | β | β | Differentiable SLAM components |
| PyPose | Python | β | β | SE(3) operations, batched |
| lietorch | Python | β | β | Lie group operations for SLAM |
YLFF Current Choice: COLMAP (Ceres Solver) - non-differentiable, CPU-based, but proven and reliable.
Future Direction: Migrate to Theseus or gradSLAM for differentiable, GPU-accelerated BA.
BA Methods Comparison
| Method | Formula | Pros | Cons | Used By |
|---|---|---|---|---|
| Gauss-Newton (GN) | Ξx = -(Jα΅J)β»ΒΉ Jα΅r | Quadratic convergence near minimum | Can diverge if far from minimum | Many |
| Levenberg-Marquardt (LM) | Ξx = -(Jα΅J + Ξ»I)β»ΒΉ Jα΅r | Interpolates GN and GD, more stable | Need to tune Ξ» schedule | COLMAP |
| Dogleg (Powell) | Trust region method | More robust than LM | More complex implementation | Some |
| Preconditioned Conjugate Gradient (PCG) | Iterative solver | Scales to very large problems, GPU-friendly | Slower convergence, needs good preconditioner | Large-scale |
Computational Complexity
| Operation | Complexity | GPU-Parallel? | YLFF Status |
|---|---|---|---|
| Jacobian computation | O(K) per obs, O(NΓMΓkΜ) total | β YES | β (COLMAP) |
| Hessian blocks | O(K) per obs | β YES | β (COLMAP) |
| Schur complement formation | O(NΒ² Γ kΜ) | β οΈ Partial | β (COLMAP) |
| Reduced system solve | O(NΒ³) dense, O(N Γ kΜΒ²) sparse | β Hard | β (COLMAP, CPU) |
| Point back-substitution | O(M Γ kΜΒ²) | β YES | β (COLMAP) |
Future Research Directions
1. Differentiable BA Integration
Goal: Replace COLMAP BA with differentiable BA for end-to-end training.
Approach:
- Integrate Theseus or gradSLAM
- GPU-accelerated BA using PCG
- End-to-end training with BA in the loop
Benefits:
- Gradient flow through BA
- GPU acceleration
- End-to-end optimization
Challenges:
- Convergence guarantees
- Local minima
- Computational cost
2. Research-Oriented Pipeline
Vision: Fully GPU-parallel, differentiable pipeline.
# Forward pass (fully GPU-parallel)
features = backbone(images) # [N, H, W, C]
poses_pred, pose_uncertainty = pose_head(features) # [N, 4, 4], [N, 6, 6]
depths_pred, depth_uncertainty = depth_head(features) # [N, H, W], [N, H, W]
correspondences = match_head(features) # [N, N, K, 2] soft matches
# Differentiable geometric losses (still GPU-parallel)
epipolar_loss = epipolar_error(correspondences, poses_pred)
reproj_loss = reprojection_error(depths_pred, poses_pred, correspondences)
lidar_loss = l1_loss(depths_pred, lidar_sparse, weights=lidar_confidence)
# Optional: differentiable BA refinement (GPU, but iterative)
poses_refined = differentiable_ba(poses_pred, correspondences, depths_pred)
# Total loss with uncertainty weighting
loss = (epipolar_loss / pose_uncertainty.det() +
reproj_loss / depth_uncertainty +
lidar_loss)
Architecture:
Images β Backbone (DINOv2/ViT) β Dense Features
β
Cross-View Attention (all pairs)
β
βββββββββββββββΌββββββββββββββ
β β β
Pose Head Depth Head Correspondence Head
(per-frame) (per-pixel) (soft matches)
β β β
Poses [N,4,4] Depths [N,H,W] Matches [N,N,K,K]
β
Differentiable Triangulation
β
Differentiable Reprojection Loss
3. Uncertainty Calibration
Goal: Learn properly calibrated uncertainty.
Approach:
- Calibration loss on held-out validation set
- Uncertainty that means something (not just confidence scores)
- End-to-end uncertainty learning
Research Questions:
- How to calibrate pose uncertainty?
- How to calibrate depth uncertainty?
- Can uncertainty predict BA failure?
4. Geometric Constraint Enforcement
Goal: Architecture that enforces geometric constraints.
Current Gap: DA3 and VGGT don't enforce epipolar geometry.
Research Direction:
- Epipolar constraint as differentiable loss
- Reprojection consistency as loss
- Architecture that naturally satisfies constraints
5. Multi-Modal Fusion
Goal: Learn from camera + LiDAR + IMU.
Challenge: ARKit doesn't provide raw IMU data.
Research Direction:
- LiDAR depth as supervision signal
- IMU integration (if raw data available)
- Learned sensor fusion
6. Self-Improving Pipeline
Vision: Pipeline that gets better over time.
1. Collect data (ARKit + LiDAR + images)
β
2. Run DA3 (fast inference)
β
3. Run BA validation (slow, offline)
β
4. Filter/label samples:
- DA3 agrees with BA β high-quality training sample
- DA3 disagrees β use BA as pseudo-label, retrain DA3
- Both fail β reject, but save for failure analysis
β
5. Retrain DA3 on curated data
β
6. Repeat β DA3 gets better over time
YLFF Status: β Implemented for fine-tuning, π Being validated for pre-training.
Implementation Roadmap
Phase 1: Foundation (β Complete)
- BA validation pipeline (COLMAP + SuperPoint + LightGlue)
- ARKit data processing
- Fine-tuning on failure cases
- Pre-training on ARKit sequences
- Real-time visualization
- Model selection and recommendations
Phase 2: Optimization (π In Progress)
- Feature caching
- Smart pair selection
- GPU-accelerated feature extraction
- Parallel BA processing
- Distributed training
Phase 3: Differentiability (π Planned)
- Integrate Theseus or gradSLAM
- Differentiable BA in training loop
- End-to-end gradient flow
- GPU-accelerated BA
Phase 4: Advanced Features (π Planned)
- Uncertainty calibration
- Geometric constraint enforcement
- Multi-modal fusion (LiDAR + IMU)
- Self-improving pipeline automation
Phase 5: Research Contributions (π Future)
- Novel architecture with geometric constraints
- Calibrated uncertainty learning
- Differentiable BA improvements
- Learned sensor fusion
Key Takeaways
BA as Oracle: BA provides robust supervision for training visual geometry models.
ARKit as Data Source: Real-world captures provide diverse, natural training data.
Differentiability is Coming: Modern research is making traditionally non-differentiable steps differentiable.
GPU Parallelization: Most components can be parallelized, but BA remains a bottleneck.
YLFF is the Foundation: Current implementation provides the infrastructure for future research.
Research Opportunities: Uncertainty calibration, geometric constraints, differentiable BA, multi-modal fusion.
References
- DA3: Depth Anything 3 - Unified depth-ray representation
- VGGT: Visual Geometry Grounded Transformer
- COLMAP: Structure-from-Motion and Multi-View Stereo
- Theseus: Differentiable optimization library (Meta)
- gradSLAM: Differentiable SLAM components
- SuperPoint/SuperGlue: Learned feature extraction and matching
- LightGlue: Fast learned feature matching
Last Updated: 2024 Project: You Learn From Failure (YLFF)