Spaces:

azan888
/

3d_model

Running

App Files Files Community

3d_model / research_docs /RESEARCH_FOCUS.md

Azan

Clean deployment build (Squashed)

7a87926 3 months ago

preview code

raw

history blame contribute delete

29.4 kB

Research Focus: BA-Supervised Learning for Visual Geometry

Executive Summary
Current Implementation (YLFF)
Research Questions
Component Analysis
Data Quality Hierarchy
Differentiability & GPU Parallelization
Future Research Directions
Implementation Roadmap

Executive Summary

This document outlines the research focus for You Learn From Failure (YLFF), a framework for improving visual geometry models using Bundle Adjustment (BA) as an oracle teacher. The core hypothesis is that BA provides a robust, geometrically consistent supervision signal that can be used for both fine-tuning and large-scale pre-training.

Key Insights

BA as Oracle: Bundle Adjustment provides geometrically consistent poses and depths that can serve as high-quality supervision
ARKit as Data Source: Real-world ARKit captures provide diverse, natural motion data for training
Differentiable Components: Modern research is making traditionally non-differentiable steps (matching, RANSAC, BA) differentiable
GPU Parallelization: Most components can be parallelized, but BA remains a bottleneck

Research Philosophy

Don't treat ARKit as ground truth - it's derived. Focus on what's differentiable and GPU-parallel. Use geometric constraints as losses, not hard solves. Learn uncertainty, but calibrate it properly. LiDAR sparse depth is your best "ground truth" for depth.

Current Implementation (YLFF)

Overview

YLFF is a complete framework for BA-supervised learning, currently implemented with:

✅ BA Validation Pipeline: COLMAP-based validation using SuperPoint + LightGlue
✅ ARKit Integration: Full support for ARKit video and metadata processing
✅ Fine-Tuning: Train on failure cases using BA poses as pseudo-labels
✅ Pre-Training: Large-scale training on ARKit sequences with BA supervision
✅ Real-time Visualization: GUI for monitoring validation progress
✅ Model Selection: Automatic selection of best DA3 model (DA3NESTED-GIANT-LARGE for BA workflows)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    YLFF Pipeline                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Collection (Device)                                      │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │ ARKit Capture → Images + LiDAR + VIO Poses + Metadata   │  │
│  │ (passive collection, all quality levels)                 │  │
│  └──────────────────────┬──────────────────────────────────┘  │
│                         ↓                                       │
│  BA Validation (Offline)                                        │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │ 1. Run DA3 → Poses_DA3, Depths_DA3                      │  │
│  │ 2. Extract SuperPoint features                           │  │
│  │ 3. Match with LightGlue                                   │  │
│  │ 4. Run COLMAP BA → Poses_BA                              │  │
│  │ 5. Compare: error = ||Poses_DA3 - Poses_BA||            │  │
│  │ 6. Categorize:                                            │  │
│  │    - Accept (error < 2°): Model good, skip              │  │
│  │    - Reject-Learnable (2° < error < 30°): TRAIN         │  │
│  │    - Reject-Outlier (error > 30°): Discard              │  │
│  └──────────────────────┬──────────────────────────────────┘  │
│                         ↓                                       │
│  Training (GPU)                                                 │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │ Fine-Tuning: Train on rejected-learnable samples         │  │
│  │ Pre-Training: Train on all ARKit sequences               │  │
│  │ Loss: pose_loss(model(images), Poses_BA)                │  │
│  └─────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Current Capabilities

1. BA Validation (`ylff validate`)

Sequence Validation: Validate any image sequence with BA
ARKit Validation: Specialized pipeline for ARKit data with ground truth comparison
Real-time GUI: Monitor validation progress with live visualization
Feature Caching: Optimized feature extraction with caching
Smart Pairing: Reduce matching pairs for faster processing

2. Dataset Building (`ylff dataset build`)

Automatic Curation: Process sequences and categorize by BA agreement
Training Set Generation: Build training sets from rejected-learnable samples
Quality Filtering: Filter by BA quality, reprojection error, etc.

3. Fine-Tuning (`ylff train start`)

BA-Supervised Training: Train on failure cases using BA poses as labels
Weighted Loss: Weight samples by error magnitude
Checkpointing: Save model checkpoints during training

4. Pre-Training (`ylff train pretrain`)

ARKit-Scale Training: Process hundreds of ARKit sequences
BA as Teacher: Use BA poses and depths as supervision
Optional Depth Supervision: Use BA depth maps or LiDAR depth
Quality Filtering: Filter sequences by BA quality

5. Evaluation (`ylff eval`)

BA Agreement Rate: Measure % of samples with error < threshold
Pose Error Metrics: Rotation and translation errors
Comparison Tools: Compare model predictions vs BA vs ARKit

Implementation Status

Component	Status	Notes
BA Validation	✅ Complete	COLMAP + SuperPoint + LightGlue
ARKit Processing	✅ Complete	Video + metadata extraction
Fine-Tuning	✅ Complete	BA-supervised training loop
Pre-Training	✅ Complete	ARKit-scale training pipeline
Feature Caching	✅ Complete	H5-based caching for speed
Smart Pairing	✅ Complete	Sequential/spatial pairing modes
GUI Visualization	✅ Complete	Real-time Tkinter GUI
Model Selection	✅ Complete	Auto-select best DA3 model
Differentiable BA	❌ Not Implemented	Future research direction
Uncertainty Calibration	❌ Not Implemented	Future research direction

Key Design Decisions

COLMAP BA (Not Differentiable): Using traditional COLMAP BA for validation. Differentiable BA is a future research direction.
ARKit as Data Source: Leveraging real-world ARKit captures for diverse training data.
BA as Oracle: Treating BA as the ground truth teacher, not ARKit VIO poses.
Failure-Focused Fine-Tuning: Training on cases where model fails, not all data.
Scale-Aware Pre-Training: Pre-training on all ARKit sequences for large-scale learning.

Research Questions

1. Can BA Serve as an Oracle Teacher?

Question: Is BA robust enough to provide reliable supervision signals for training?

Hypothesis: Yes - BA provides geometrically consistent poses that are more reliable than VIO alone.

Validation:

✅ Fine-tuning on BA-supervised data improves model accuracy
🔄 Pre-training on ARKit sequences validates scalability
❓ Long-term: Does BA-supervised model generalize better?

Status: In Progress - Fine-tuning works, pre-training being validated.

2. Can Differentiable BA Match Traditional BA Quality?

Question: Can differentiable BA (Theseus, gradSLAM) achieve the same quality as COLMAP?

Hypothesis: Potentially, but convergence and local minima are challenges.

Research Direction:

Differentiable BA libraries: Theseus, gradSLAM, PyPose
End-to-end training with BA in the loop
GPU-accelerated BA using PCG (Preconditioned Conjugate Gradient)

Status: Future Work - Currently using COLMAP (non-differentiable).

3. Can Uncertainty Be Properly Calibrated End-to-End?

Question: Can we learn uncertainty that is both useful and calibrated?

Current State: DA3/VGGT confidence heads are trained but not calibrated.

Research Opportunity:

Proper calibration requires held-out validation
Uncertainty that means something (not just confidence scores)
End-to-end uncertainty learning with calibration loss

Status: Future Work - Not yet implemented.

4. What's the Right Inductive Bias?

Question: What architecture best enforces geometric constraints?

Current Approaches:

VGGT: Point maps (direct 3D)
DA3: Depth + rays (decomposed)
Neither: Enforces epipolar geometry

Research Opportunity: Architecture that enforces geometric constraints as part of the model, not just losses.

Status: Future Work - Current models don't enforce epipolar constraints.

5. Can Learned Sensor Fusion Beat ARKit's VIO?

Question: Can learned approaches outperform Apple's years of VIO engineering?

Challenge: Requires raw IMU data (ARKit gives poses, not raw IMU).

Potential: Learned approaches could generalize better across devices/scenarios.

Status: Future Work - Requires raw IMU data access.

Component Analysis

Component Comparison Matrix

Component	ARKit	COLMAP+hloc	DA3	VGGT	YLFF
Feature Extraction	Proprietary (ORB-like?)	SIFT or SuperPoint	DINOv2 (implicit)	DINOv2 (implicit)	SuperPoint
Feature Matching	KLT tracking + IMU	Exhaustive or SuperGlue	Cross-view attention	Cross-view attention	LightGlue
Relative Pose	VIO (EKF/factor graph)	Essential matrix + RANSAC	Learned (ray head)	Learned (camera head)	BA-refined
Global Optimization	NONE (drift accumulates)	Bundle Adjustment ✓	NONE	NONE	COLMAP BA
Dense Depth	LiDAR (sparse)	MVS (slow)	Learned (fast)	Learned (fast)	DA3 + BA depth
Metric Scale	IMU ✓	NONE (up to scale)	Trained with f_c=300	Normalized	BA (metric)
Uncertainty	trackingState (discrete)	Covariance from BA	Confidence head	Confidence head	BA quality metrics

What's "Raw" vs "Derived" in ARKit

Data	Raw or Derived?	Source
Camera pixels	Raw	CMOS sensor readout
IMU acceleration	Raw	Accelerometer (100Hz+)
IMU angular velocity	Raw	Gyroscope (100Hz+)
LiDAR ToF returns	Raw	Time-of-flight pulses
Camera intrinsics	Calibrated (stored)	Factory calibration
---	---	---
Poses (transform)	DERIVED	VIO algorithm (EKF/factor graph)
Tracking state	DERIVED	VIO internal heuristics
Feature points	DERIVED	ARKit's detector (ORB-like?)
Plane anchors	DERIVED	RANSAC plane fitting
Depth confidence	DERIVED	ARKit's confidence model
World mapping status	DERIVED	SLAM internal state

Key Insight: ARKit poses are derived, not ground truth. BA provides a more robust signal.

Data Quality Hierarchy

Quality Pyramid

┌─────────────────────────────────────────────────────────────────┐
│                    DATA QUALITY PYRAMID                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐                                           │
│  │   LiDAR depth   │  ← Strongest signal (metric, direct ToF)  │
│  │  (high conf)    │                                           │
│  └────────┬────────┘                                           │
│           ↓                                                     │
│  ┌─────────────────────────┐                                   │
│  │  BA poses (refined)     │  ← YLFF uses this as teacher     │
│  │  (geometrically          │                                   │
│  │   consistent)            │                                   │
│  └────────┬────────────────┘                                   │
│           ↓                                                     │
│  ┌─────────────────────────┐                                   │
│  │  ARKit poses (normal +   │  ← Good signal (use for training)│
│  │  worldMappingStatus=    │                                   │
│  │  extending/mapped)       │                                   │
│  └────────┬────────────────┘                                   │
│           ↓                                                     │
│  ┌─────────────────────────────────┐                           │
│  │  ARKit poses (limited +         │  ← Weak signal (noisy     │
│  │  high featurePointCount)         │     but still useful)     │
│  └────────┬────────────────────────┘                           │
│           ↓                                                     │
│  ┌───────────────────────────────────────┐                     │
│  │  ARKit poses (limited, excessiveMotion│  ← Negative signal  │
│  │  or insufficientFeatures, low feature  │    (learn to detect │
│  │  count)                               │    and reject)      │
│  └───────────────────────────────────────┘                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Collection Strategy

Phase 1: Collect Everything (Passive)

Capture all ARKit data regardless of quality
Store raw video, metadata, LiDAR depth
No filtering at capture time

Phase 2: Automated Quality Filters

Reject trackingState=notAvailable
Reject worldMappingStatus=notAvailable
Reject featurePointCount < threshold
Reject LiDAR coverage < threshold

Phase 3: Human Review of Edge Cases

Borderline tracking quality
Unusual scenes (reflective, texture-poor)
Sequences where VIO drifted then recovered

Phase 4: Build Labeled Dataset

High-quality subset for supervised training
Noisy subset for robust/uncertainty training
Rejected subset for failure detection

Differentiability & GPU Parallelization

Making Traditional Pipeline Differentiable

Step	Traditional	Differentiable Version	Status
Matching	argmax over scores	Soft matching (Sinkhorn in SuperGlue)	✅ Available
RANSAC	Discrete sampling	Differentiable RANSAC (DSAC, NG-RANSAC)	✅ Available
Essential Matrix	SVD decomposition	Differentiable SVD (PyTorch)	✅ Available
Bundle Adjustment	Gauss-Newton	Differentiable BA (Theseus, gradSLAM)	✅ Available
Triangulation	Linear least squares	Differentiable triangulation	✅ Available

GPU Parallelizability Analysis

Operation	GPU-Parallel?	Why/Why Not	YLFF Status
Feature extraction	✅ YES	Per-image, embarrassingly parallel	✅ Implemented
Dense feature maps (DINOv2)	✅ YES	Forward pass, batched	✅ (DA3 uses this)
Feature matching (exhaustive)	⚠️ Partial	O(N²) pairs, but each pair is parallel	✅ (LightGlue)
Attention (cross-view)	✅ YES	Matrix ops, highly optimized	✅ (DA3 uses this)
RANSAC	❌ NO	Sequential hypothesis testing	❌ (Not used)
Essential matrix (8-point)	⚠️ Partial	Linear algebra, but per-pair	❌ (Not used)
Bundle Adjustment	❌ NO	Iterative optimization, sparse solve	❌ (COLMAP, CPU)
Cost volume (MVS)	✅ YES	3D convolutions, embarrassingly parallel	❌ (Not used)
Dense depth (learned)	✅ YES	Forward pass	✅ (DA3)
Differentiable rendering	✅ YES	Per-pixel, parallel	❌ (Not used)

GPU-Parallelizable BA Components

Component	Standard	GPU Version	Status
Jacobian	Per-observation loop	Batched einsum	🔄 Future
Hessian	Accumulate blocks	Scatter-add	🔄 Future
Linear solve	Cholesky (sequential)	PCG (parallel)	🔄 Future
Robust loss	Per-residual	Vectorized	🔄 Future

BA Software Landscape

Library	Language	GPU?	Differentiable?	Notes
Ceres Solver	C++	❌	❌	Google, most mature, COLMAP uses this
g2o	C++	❌	❌	Graph-based, ORB-SLAM uses this
GTSAM	C++	❌	❌	Factor graphs, iSAM2 incremental
Theseus	Python/C++	✅	✅	Meta, differentiable optimization
gradSLAM	Python	✅	✅	Differentiable SLAM components
PyPose	Python	✅	✅	SE(3) operations, batched
lietorch	Python	✅	✅	Lie group operations for SLAM

YLFF Current Choice: COLMAP (Ceres Solver) - non-differentiable, CPU-based, but proven and reliable.

Future Direction: Migrate to Theseus or gradSLAM for differentiable, GPU-accelerated BA.

BA Methods Comparison

Method	Formula	Pros	Cons	Used By
Gauss-Newton (GN)	Δx = -(JᵀJ)⁻¹ Jᵀr	Quadratic convergence near minimum	Can diverge if far from minimum	Many
Levenberg-Marquardt (LM)	Δx = -(JᵀJ + λI)⁻¹ Jᵀr	Interpolates GN and GD, more stable	Need to tune λ schedule	COLMAP
Dogleg (Powell)	Trust region method	More robust than LM	More complex implementation	Some
Preconditioned Conjugate Gradient (PCG)	Iterative solver	Scales to very large problems, GPU-friendly	Slower convergence, needs good preconditioner	Large-scale

Computational Complexity

Operation	Complexity	GPU-Parallel?	YLFF Status
Jacobian computation	O(K) per obs, O(N×M×k̄) total	✅ YES	✅ (COLMAP)
Hessian blocks	O(K) per obs	✅ YES	✅ (COLMAP)
Schur complement formation	O(N² × k̄)	⚠️ Partial	✅ (COLMAP)
Reduced system solve	O(N³) dense, O(N × k̄²) sparse	❌ Hard	✅ (COLMAP, CPU)
Point back-substitution	O(M × k̄²)	✅ YES	✅ (COLMAP)

Future Research Directions

1. Differentiable BA Integration

Goal: Replace COLMAP BA with differentiable BA for end-to-end training.

Approach:

Integrate Theseus or gradSLAM
GPU-accelerated BA using PCG
End-to-end training with BA in the loop

Benefits:

Gradient flow through BA
GPU acceleration
End-to-end optimization

Challenges:

Convergence guarantees
Local minima
Computational cost

2. Research-Oriented Pipeline

Vision: Fully GPU-parallel, differentiable pipeline.

# Forward pass (fully GPU-parallel)
features = backbone(images)  # [N, H, W, C]
poses_pred, pose_uncertainty = pose_head(features)  # [N, 4, 4], [N, 6, 6]
depths_pred, depth_uncertainty = depth_head(features)  # [N, H, W], [N, H, W]
correspondences = match_head(features)  # [N, N, K, 2] soft matches

# Differentiable geometric losses (still GPU-parallel)
epipolar_loss = epipolar_error(correspondences, poses_pred)
reproj_loss = reprojection_error(depths_pred, poses_pred, correspondences)
lidar_loss = l1_loss(depths_pred, lidar_sparse, weights=lidar_confidence)

# Optional: differentiable BA refinement (GPU, but iterative)
poses_refined = differentiable_ba(poses_pred, correspondences, depths_pred)

# Total loss with uncertainty weighting
loss = (epipolar_loss / pose_uncertainty.det() +
        reproj_loss / depth_uncertainty +
        lidar_loss)

Architecture:

Images → Backbone (DINOv2/ViT) → Dense Features
         ↓
    Cross-View Attention (all pairs)
         ↓
    ┌─────────────┼─────────────┐
    ↓             ↓             ↓
Pose Head    Depth Head   Correspondence Head
(per-frame)  (per-pixel)  (soft matches)
    ↓             ↓             ↓
Poses [N,4,4] Depths [N,H,W] Matches [N,N,K,K]
    ↓
Differentiable Triangulation
    ↓
Differentiable Reprojection Loss

3. Uncertainty Calibration

Goal: Learn properly calibrated uncertainty.

Approach:

Calibration loss on held-out validation set
Uncertainty that means something (not just confidence scores)
End-to-end uncertainty learning

Research Questions:

How to calibrate pose uncertainty?
How to calibrate depth uncertainty?
Can uncertainty predict BA failure?

4. Geometric Constraint Enforcement

Goal: Architecture that enforces geometric constraints.

Current Gap: DA3 and VGGT don't enforce epipolar geometry.

Research Direction:

Epipolar constraint as differentiable loss
Reprojection consistency as loss
Architecture that naturally satisfies constraints

5. Multi-Modal Fusion

Goal: Learn from camera + LiDAR + IMU.

Challenge: ARKit doesn't provide raw IMU data.

Research Direction:

LiDAR depth as supervision signal
IMU integration (if raw data available)
Learned sensor fusion

6. Self-Improving Pipeline

Vision: Pipeline that gets better over time.

1. Collect data (ARKit + LiDAR + images)
   ↓
2. Run DA3 (fast inference)
   ↓
3. Run BA validation (slow, offline)
   ↓
4. Filter/label samples:
   - DA3 agrees with BA → high-quality training sample
   - DA3 disagrees → use BA as pseudo-label, retrain DA3
   - Both fail → reject, but save for failure analysis
   ↓
5. Retrain DA3 on curated data
   ↓
6. Repeat → DA3 gets better over time

YLFF Status: ✅ Implemented for fine-tuning, 🔄 Being validated for pre-training.

Implementation Roadmap

Phase 1: Foundation (✅ Complete)

BA validation pipeline (COLMAP + SuperPoint + LightGlue)
ARKit data processing
Fine-tuning on failure cases
Pre-training on ARKit sequences
Real-time visualization
Model selection and recommendations

Phase 2: Optimization (🔄 In Progress)

Feature caching
Smart pair selection
GPU-accelerated feature extraction
Parallel BA processing
Distributed training

Phase 3: Differentiability (📋 Planned)

Integrate Theseus or gradSLAM
Differentiable BA in training loop
End-to-end gradient flow
GPU-accelerated BA

Phase 4: Advanced Features (📋 Planned)

Uncertainty calibration
Geometric constraint enforcement
Multi-modal fusion (LiDAR + IMU)
Self-improving pipeline automation

Phase 5: Research Contributions (📋 Future)

Novel architecture with geometric constraints
Calibrated uncertainty learning
Differentiable BA improvements
Learned sensor fusion

Key Takeaways

BA as Oracle: BA provides robust supervision for training visual geometry models.
ARKit as Data Source: Real-world captures provide diverse, natural training data.
Differentiability is Coming: Modern research is making traditionally non-differentiable steps differentiable.
GPU Parallelization: Most components can be parallelized, but BA remains a bottleneck.
YLFF is the Foundation: Current implementation provides the infrastructure for future research.
Research Opportunities: Uncertainty calibration, geometric constraints, differentiable BA, multi-modal fusion.

References

DA3: Depth Anything 3 - Unified depth-ray representation
VGGT: Visual Geometry Grounded Transformer
COLMAP: Structure-from-Motion and Multi-View Stereo
Theseus: Differentiable optimization library (Meta)
gradSLAM: Differentiable SLAM components
SuperPoint/SuperGlue: Learned feature extraction and matching
LightGlue: Fast learned feature matching

Last Updated: 2024 Project: You Learn From Failure (YLFF)

Research Focus: BA-Supervised Learning for Visual Geometry

Table of Contents

Executive Summary

Key Insights

Research Philosophy

Current Implementation (YLFF)

Overview

Architecture

Current Capabilities

1. BA Validation (ylff validate)

2. Dataset Building (ylff dataset build)

3. Fine-Tuning (ylff train start)

4. Pre-Training (ylff train pretrain)

5. Evaluation (ylff eval)

Implementation Status

Key Design Decisions

Research Questions

1. Can BA Serve as an Oracle Teacher?

2. Can Differentiable BA Match Traditional BA Quality?

3. Can Uncertainty Be Properly Calibrated End-to-End?

4. What's the Right Inductive Bias?

5. Can Learned Sensor Fusion Beat ARKit's VIO?

Component Analysis

Component Comparison Matrix

What's "Raw" vs "Derived" in ARKit

Data Quality Hierarchy

Quality Pyramid

Data Collection Strategy

Differentiability & GPU Parallelization

Making Traditional Pipeline Differentiable

GPU Parallelizability Analysis

GPU-Parallelizable BA Components

BA Software Landscape

BA Methods Comparison

Computational Complexity

Future Research Directions

1. Differentiable BA Integration

2. Research-Oriented Pipeline

3. Uncertainty Calibration

4. Geometric Constraint Enforcement

5. Multi-Modal Fusion

6. Self-Improving Pipeline

Implementation Roadmap

Phase 1: Foundation (✅ Complete)

Phase 2: Optimization (🔄 In Progress)

Phase 3: Differentiability (📋 Planned)

Phase 4: Advanced Features (📋 Planned)

Phase 5: Research Contributions (📋 Future)

Key Takeaways

References

1. BA Validation (`ylff validate`)

2. Dataset Building (`ylff dataset build`)

3. Fine-Tuning (`ylff train start`)

4. Pre-Training (`ylff train pretrain`)

5. Evaluation (`ylff eval`)