---
title: YLFF Training
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---
# You Learn From Failure (YLFF)
**Geometric Consistency First: Training Visual Geometry Models with BA Supervision**
## Overview
YLFF is a unified framework for training geometrically accurate depth estimation models using Bundle Adjustment (BA) and LiDAR as oracle teachers. Unlike traditional approaches that prioritize perceptual quality, YLFF treats **geometric consistency as a first-order goal**.
### Core Philosophy
**Geometric Accuracy > Perceptual Quality**
- Multi-view geometric consistency is the **primary objective** (not just regularization)
- Absolute scale accuracy is **critical** for metric depth estimation
- Multi-view pose consistency is **essential** for 3D reconstruction
- Teacher-student learning provides **stability** during training
## End-to-End Pipeline
The complete YLFF pipeline from data collection to trained model:
```mermaid
flowchart TD
Start([Start: Data Collection]) --> Upload[Upload ARKit Sequences]
Upload --> Extract[Extract ARKit Data
Poses, LiDAR, Intrinsics]
Extract --> Preprocess{Pre-Processing Phase
Offline, Expensive}
Preprocess --> DA3Infer[Run DA3 Inference
Initial Predictions]
DA3Infer --> QualityCheck{ARKit Quality
Check}
QualityCheck -->|High Quality
≥ 0.8| UseARKit[Use ARKit Poses
Skip BA]
QualityCheck -->|Low Quality
< 0.8| RunBA[Run BA Validation
Refine Poses]
UseARKit --> OracleUncertainty[Compute Oracle Uncertainty
Confidence Maps]
RunBA --> OracleUncertainty
OracleUncertainty --> SelectTargets[Select Oracle Targets
BA or ARKit Poses]
SelectTargets --> Cache[Save to Cache
oracle_targets.npz
uncertainty_results.npz]
Cache --> TrainingPhase{Training Phase
Online, Fast}
TrainingPhase --> LoadCache[Load Pre-Computed
Oracle Results]
LoadCache --> LoadModel[Load/Resume Model
Student + Teacher]
LoadModel --> TrainingLoop[Training Loop]
TrainingLoop --> Forward[Forward Pass
Student Model Inference]
Forward --> ComputeLoss[Compute Geometric Losses
Multi-view: 3.0
Absolute Scale: 2.5
Pose: 2.0
Gradient: 1.0
Teacher: 0.5]
ComputeLoss --> Backward[Backward Pass
Gradient Computation]
Backward --> ClipGrad[Gradient Clipping
Max Norm: 1.0]
ClipGrad --> Update[Update Weights
AdamW Optimizer]
Update --> UpdateTeacher[Update Teacher Model
EMA Decay: 0.999]
UpdateTeacher --> Scheduler[Update Learning Rate
Cosine Annealing]
Scheduler --> Checkpoint{Checkpoint
Interval?}
Checkpoint -->|Every N Steps| SaveCheckpoint[Save Checkpoint
Periodic + Best + Latest]
Checkpoint -->|Continue| LogMetrics[Log Metrics
W&B / Console]
SaveCheckpoint --> LogMetrics
LogMetrics --> EpochComplete{Epoch
Complete?}
EpochComplete -->|No| TrainingLoop
EpochComplete -->|Yes| MoreEpochs{More
Epochs?}
MoreEpochs -->|Yes| TrainingLoop
MoreEpochs -->|No| SaveFinal[Save Final Checkpoint
Final Model State]
SaveFinal --> Evaluate[Evaluate Model
BA Agreement]
Evaluate --> Results[Training Results
Metrics & Checkpoints]
Results --> Resume{Resume
Training?}
Resume -->|Yes| LoadCheckpoint[Load Checkpoint
latest_checkpoint.pt]
LoadCheckpoint --> LoadModel
Resume -->|No| End([End: Trained Model])
style Preprocess fill:#e1f5ff
style TrainingPhase fill:#fff4e1
style ComputeLoss fill:#ffe1f5
style SaveCheckpoint fill:#e1ffe1
style Evaluate fill:#f5e1ff
```
### Pipeline Stages
#### 1. Data Collection & Upload
- **Input**: ARKit sequences (video + metadata.json)
- **Extract**: Poses, LiDAR depth, camera intrinsics
- **Output**: Structured ARKit data
#### 2. Pre-Processing Phase (Offline)
- **DA3 Inference**: Initial depth/pose predictions (GPU)
- **Quality Check**: Evaluate ARKit tracking quality
- **BA Validation**: Run only if ARKit quality < threshold (CPU, expensive)
- **Oracle Uncertainty**: Compute confidence maps from multiple sources
- **Cache Results**: Save oracle targets and uncertainty to disk
- **Time**: ~10-20 min per sequence (one-time cost)
#### 3. Training Phase (Online)
- **Load Cache**: Fast disk I/O of pre-computed results
- **Model Loading**: Load or resume from checkpoint (student + teacher)
- **Training Loop**:
- Forward pass through student model
- Compute geometric losses (primary objective)
- Backward pass with gradient clipping
- Update weights (AdamW optimizer)
- Update teacher model (EMA)
- Update learning rate (cosine scheduler)
- **Checkpointing**: Save periodic, best, and latest checkpoints
- **Logging**: Metrics to W&B and console
- **Time**: ~1-3 sec per sequence (100-1000x faster than BA)
#### 4. Evaluation & Resumption
- **Evaluation**: Test model agreement with BA
- **Resume**: Load checkpoint to continue training
- **Final Model**: Best checkpoint saved for deployment
## Key Features
### 🎯 Unified Training Approach
- **Single Training Service**: `ylff/services/ylff_training.py` consolidates all training methods
- **DINOv2 Backbone**: Teacher-student paradigm with EMA teacher for stable training
- **DA3 Techniques**: Depth-ray representation, multi-resolution training
- **Geometric Losses**: Multi-view consistency, absolute scale, pose accuracy as primary objectives
### 📊 Two-Phase Pipeline
1. **Pre-Processing Phase** (offline, expensive)
- Compute BA validation and oracle uncertainty
- Cache results for fast training iteration
- Can be parallelized across sequences
2. **Training Phase** (online, fast)
- Load pre-computed oracle results
- Train with geometric losses as primary objective
- 100-1000x faster than computing BA during training
### 🔧 Core Components
- **BA Validation**: Validate model predictions using COLMAP Bundle Adjustment
- **ARKit Integration**: Process ARKit data with ground truth poses and LiDAR depth
- **Oracle Uncertainty**: Continuous confidence weighting (not binary rejection)
- **Geometric Losses**: Multi-view consistency, absolute scale, pose reprojection error
- **Unified Training**: Single training service with geometric consistency first
## Installation
### Basic Installation
```bash
# Clone repository
git clone
cd ylff
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install package
pip install -e .
# Install optional dependencies
pip install -e ".[gui]" # For GUI visualization
```
### BA Pipeline Setup
For BA validation, you need additional dependencies:
```bash
# Install BA pipeline dependencies
bash scripts/bin/setup_ba_pipeline.sh
# Or manually:
pip install pycolmap
# Install hloc from source (see docs/SETUP.md)
# Install LightGlue from source (see docs/SETUP.md)
```
See `docs/SETUP.md` for detailed installation instructions.
## Quick Start
### 1. Pre-Process ARKit Sequences
```bash
# Pre-process ARKit sequences (offline, can run overnight)
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed \
--model-name depth-anything/DA3-LARGE \
--num-workers 8 \
--prefer-arkit-poses
```
This computes BA and oracle uncertainty for all sequences and caches results.
### 2. Train with Unified Service
```bash
# Train using pre-computed results (fast iteration)
ylff train unified cache/preprocessed \
--model-name depth-anything/DA3-LARGE \
--epochs 200 \
--lr 2e-4 \
--batch-size 32 \
--checkpoint-dir checkpoints \
--use-wandb
```
Or use the Python API:
```python
from ylff.services.ylff_training import train_ylff
from ylff.services.preprocessed_dataset import PreprocessedARKitDataset
# Load preprocessed dataset
dataset = PreprocessedARKitDataset(
cache_dir="cache/preprocessed",
arkit_sequences_dir="data/arkit_sequences",
load_images=True,
)
# Train with unified service
metrics = train_ylff(
model=da3_model,
dataset=dataset,
epochs=200,
lr=2e-4,
batch_size=32,
loss_weights={
'geometric_consistency': 3.0, # PRIMARY GOAL
'absolute_scale': 2.5, # CRITICAL
'pose_geometric': 2.0, # ESSENTIAL
},
use_wandb=True,
checkpoint_dir=Path("checkpoints"),
)
```
### 3. Validate Sequences
```bash
# Validate a sequence of images
ylff validate sequence path/to/images \
--model-name depth-anything/DA3-LARGE \
--accept-threshold 2.0 \
--reject-threshold 30.0 \
--output results.json
```
### 4. Evaluate Model
```bash
# Evaluate model agreement with BA
ylff eval ba-agreement path/to/test/sequences \
--model-name depth-anything/DA3-LARGE \
--checkpoint checkpoints/best_model.pt \
--threshold 2.0
```
## Training Approach
### Unified Training Service
YLFF uses a **single, unified training service** (`ylff/services/ylff_training.py`) that:
1. **Uses DINOv2's teacher-student paradigm** as the backbone
- EMA teacher provides stable targets
- Layer-wise learning rate decay
- Cosine scheduler with warmup
2. **Incorporates DA3 techniques**
- Depth-ray representation (if available)
- Multi-resolution training support
- Scale normalization
3. **Treats geometric consistency as first-order goal**
- Multi-view geometric consistency: **weight 3.0** (PRIMARY)
- Absolute scale loss: **weight 2.5** (CRITICAL)
- Pose geometric loss: **weight 2.0** (ESSENTIAL)
- Gradient loss: **weight 1.0** (DA3 technique)
- Teacher-student consistency: **weight 0.5** (STABILITY)
### Experiment Tracking & Ablations
YLFF integrates **Weights & Biases (W&B)** for comprehensive experiment tracking and ablation studies:
**Logged Configuration** (per run):
- Training hyperparameters: `epochs`, `lr`, `batch_size`, `ema_decay`
- Loss weights: All component weights (geometric_consistency, absolute_scale, pose_geometric, gradient_loss, teacher_consistency)
- Model configuration: Task type, device, precision (FP16/BF16)
**Logged Metrics** (per step):
- **Loss Components**: All individual loss terms tracked separately
- `total_loss`: Overall training loss
- `geometric_consistency`: Multi-view consistency loss
- `absolute_scale`: Absolute depth scale loss
- `pose_geometric`: Pose reprojection error loss
- `gradient_loss`: Depth gradient loss
- `teacher_consistency`: Teacher-student consistency loss
- **Training State**: `step`, `epoch`, `lr` (learning rate over time)
**Ablation Study Support**:
- **Compare runs**: Filter by hyperparameters (loss weights, learning rate, etc.)
- **Track component contributions**: See how each loss component evolves
- **Hyperparameter sweeps**: Use W&B sweeps to systematically explore configurations
- **Reproducibility**: All hyperparameters logged in config for exact reproduction
**Example Ablation Workflow**:
```bash
# Run 1: Baseline (default geometric-first weights)
ylff train unified cache/preprocessed \
--epochs 200 \
--use-wandb \
--wandb-project ylff-ablations \
--wandb-name baseline-geometric-first
# Run 2: Ablation: Lower geometric consistency weight
ylff train unified cache/preprocessed \
--epochs 200 \
--use-wandb \
--wandb-project ylff-ablations \
--wandb-name ablation-lower-geo-weight \
--loss-weight-geometric-consistency 1.0 # vs default 3.0
# Run 3: Ablation: No teacher-student consistency
ylff train unified cache/preprocessed \
--epochs 200 \
--use-wandb \
--wandb-project ylff-ablations \
--wandb-name ablation-no-teacher \
--loss-weight-teacher-consistency 0.0 # Disable teacher loss
# Compare in W&B dashboard:
# - Filter by project: "ylff-ablations"
# - Compare loss curves across runs
# - Analyze which loss components matter most
```
**W&B Dashboard Features**:
- **Parallel coordinates plot**: Visualize hyperparameter relationships
- **Loss curves**: Compare training dynamics across ablations
- **Component analysis**: See contribution of each loss term
- **Best run identification**: Automatically identify best configurations
### Suggested Ablation Studies
Based on YLFF's architecture, here are key ablation experiments to validate our design choices:
#### 1. Loss Weight Ablations (Geometric Consistency First)
**Question**: How critical is treating geometric consistency as a first-order goal?
```python
from ylff.services.ylff_training import train_ylff
from ylff.services.preprocessed_dataset import PreprocessedARKitDataset
# Baseline: Geometric-first (default)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
use_wandb=True,
wandb_project="ylff-ablations",
loss_weights={
'geometric_consistency': 3.0, # PRIMARY GOAL
'absolute_scale': 2.5,
'pose_geometric': 2.0,
'gradient_loss': 1.0,
'teacher_consistency': 0.5,
},
)
# Ablation 1: Equal weights (traditional approach)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
use_wandb=True,
wandb_project="ylff-ablations",
loss_weights={
'geometric_consistency': 1.0, # Equal weight
'absolute_scale': 1.0,
'pose_geometric': 1.0,
'gradient_loss': 1.0,
'teacher_consistency': 0.5,
},
)
# Ablation 2: Perceptual-first (reverse priority)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
use_wandb=True,
wandb_project="ylff-ablations",
loss_weights={
'geometric_consistency': 0.5, # Lower priority
'absolute_scale': 0.5,
'pose_geometric': 0.5,
'gradient_loss': 3.0, # Emphasize smoothness
'teacher_consistency': 0.5,
},
)
# Ablation 3: Remove geometric consistency entirely
train_ylff(
model=model,
dataset=dataset,
epochs=200,
use_wandb=True,
wandb_project="ylff-ablations",
loss_weights={
'geometric_consistency': 0.0, # Disabled
'absolute_scale': 2.5,
'pose_geometric': 2.0,
'gradient_loss': 1.0,
'teacher_consistency': 0.5,
},
)
```
**Metrics to Compare**:
- Final geometric consistency loss
- BA agreement (reprojection error)
- Absolute scale accuracy (vs LiDAR)
- Multi-view reconstruction quality
#### 2. Teacher-Student Ablation
**Question**: Does EMA teacher provide training stability and better convergence?
```python
# Baseline: With EMA teacher (default ema_decay=0.999)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
ema_decay=0.999,
use_wandb=True,
wandb_project="ylff-ablations",
)
# Ablation 1: No teacher-student (ema_decay=0.0)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
ema_decay=0.0, # No EMA updates
loss_weights={
'geometric_consistency': 3.0,
'absolute_scale': 2.5,
'pose_geometric': 2.0,
'gradient_loss': 1.0,
'teacher_consistency': 0.0, # Disable teacher loss
},
use_wandb=True,
wandb_project="ylff-ablations",
)
# Ablation 2: Faster teacher updates (ema_decay=0.99)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
ema_decay=0.99, # Faster updates
use_wandb=True,
wandb_project="ylff-ablations",
)
# Ablation 3: Slower teacher updates (ema_decay=0.9999)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
ema_decay=0.9999, # Slower updates
use_wandb=True,
wandb_project="ylff-ablations",
)
```
**Metrics to Compare**:
- Training stability (loss variance)
- Convergence speed
- Final model quality
- Teacher-student consistency loss
#### 3. Oracle Source Ablation (BA vs ARKit)
**Question**: How much does BA refinement improve over ARKit poses?
```bash
# Baseline: Use BA when ARKit quality < 0.8 (default)
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed-ba \
--prefer-arkit-poses --min-arkit-quality 0.8
ylff train unified cache/preprocessed-ba \
--use-wandb --wandb-project ylff-ablations
# Ablation 1: Always use ARKit (no BA, faster preprocessing)
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed-arkit-only \
--prefer-arkit-poses --min-arkit-quality 0.0
ylff train unified cache/preprocessed-arkit-only \
--use-wandb --wandb-project ylff-ablations
# Ablation 2: Always use BA (expensive but highest quality)
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed-ba-always \
--prefer-arkit-poses --min-arkit-quality 1.0 # Never use ARKit
ylff train unified cache/preprocessed-ba-always \
--use-wandb --wandb-project ylff-ablations
```
**Metrics to Compare**:
- Pose accuracy (reprojection error)
- Training data quality (confidence scores)
- Final model performance
- Preprocessing time cost
#### 4. Uncertainty Weighting Ablation
**Question**: Does confidence-weighted loss improve training vs uniform weighting?
```bash
# Baseline: With uncertainty weighting (default)
# Uses depth_confidence and pose_confidence from preprocessing
# Ablation: Uniform weighting (ignore uncertainty)
# Modify preprocessing to set all confidence = 1.0
# Or modify loss computation to ignore confidence maps
```
**Metrics to Compare**:
- Loss on high-confidence vs low-confidence regions
- Model performance on uncertain scenes
- Training stability
#### 5. Multi-View Consistency Ablation
**Question**: How many views are needed for effective geometric consistency?
```python
# Baseline: Variable views (2-18, default from dataset)
train_ylff(
model=model,
dataset=dataset, # Uses all available views
epochs=200,
use_wandb=True,
wandb_project="ylff-ablations",
)
# Ablation 1: Single view only (disable geometric consistency)
train_ylff(
model=model,
dataset=single_view_dataset, # Modified dataset with 1 view
epochs=200,
loss_weights={
'geometric_consistency': 0.0, # Disabled (needs 2+ views)
'absolute_scale': 2.5,
'pose_geometric': 2.0,
'gradient_loss': 1.0,
'teacher_consistency': 0.5,
},
use_wandb=True,
wandb_project="ylff-ablations",
)
# Ablation 2-4: Fixed N views
# Modify dataset to sample exactly N views per sequence
# Compare: 2 views, 5 views, 10 views, 18 views
```
**Metrics to Compare**:
- Geometric consistency loss
- Multi-view reconstruction accuracy
- Training efficiency (more views = slower)
#### 6. DA3 Techniques Ablation
**Question**: Which DA3 techniques contribute most?
```python
# Baseline: All DA3 techniques enabled
train_ylff(
model=model,
dataset=dataset,
epochs=200,
use_wandb=True,
wandb_project="ylff-ablations",
)
# Ablation 1: No gradient loss (DA3 edge preservation)
train_ylff(
model=model,
dataset=dataset,
epochs=200,
loss_weights={
'geometric_consistency': 3.0,
'absolute_scale': 2.5,
'pose_geometric': 2.0,
'gradient_loss': 0.0, # Disabled
'teacher_consistency': 0.5,
},
use_wandb=True,
wandb_project="ylff-ablations",
)
# Ablation 2: No depth-ray representation
# Use model that outputs separate depth + poses instead of depth-ray
# (Requires different model architecture)
# Ablation 3: Fixed resolution (no multi-resolution training)
# Modify dataset to use fixed resolution instead of variable
```
**Metrics to Compare**:
- Depth edge quality (gradient loss ablation)
- Training efficiency (multi-resolution ablation)
- Model generalization
#### 7. Preprocessing Phase Ablation
**Question**: How much does the two-phase pipeline improve training efficiency?
```bash
# Baseline: With preprocessing (fast training)
ylff preprocess arkit data/arkit_sequences --output-cache cache/preprocessed
ylff train unified cache/preprocessed \
--use-wandb --wandb-project ylff-ablations \
--wandb-name baseline-with-preprocessing
# Ablation: Live BA during training (slow but no preprocessing)
# This would require modifying training to compute BA on-the-fly
# Compare: Training time per epoch, total training time
```
**Metrics to Compare**:
- Training time per epoch
- Total training time
- Model quality (should be similar, preprocessing is just optimization)
#### 8. Loss Component Contribution Analysis
**Question**: Which loss component contributes most to final model quality?
Run systematic sweeps using W&B sweeps or Python script:
```python
# sweep_config.yaml
program: train_ablation_sweep.py
method: grid
parameters:
loss_weight_geometric_consistency:
values: [0.0, 1.0, 2.0, 3.0, 4.0]
loss_weight_absolute_scale:
values: [0.0, 1.0, 2.0, 2.5, 3.0]
loss_weight_pose_geometric:
values: [0.0, 1.0, 2.0, 3.0]
loss_weight_gradient_loss:
values: [0.0, 0.5, 1.0, 1.5]
loss_weight_teacher_consistency:
values: [0.0, 0.25, 0.5, 0.75, 1.0]
# train_ablation_sweep.py
import wandb
from ylff.services.ylff_training import train_ylff
wandb.init()
config = wandb.config
train_ylff(
model=model,
dataset=dataset,
epochs=200,
loss_weights={
'geometric_consistency': config.loss_weight_geometric_consistency,
'absolute_scale': config.loss_weight_absolute_scale,
'pose_geometric': config.loss_weight_pose_geometric,
'gradient_loss': config.loss_weight_gradient_loss,
'teacher_consistency': config.loss_weight_teacher_consistency,
},
use_wandb=True,
wandb_project="ylff-ablations",
)
# Run: wandb sweep sweep_config.yaml
```
**Analysis**:
- Use W&B parallel coordinates plot to find optimal weight combinations
- Identify which components are essential vs optional
- Find Pareto frontier (best quality for given training time)
#### Recommended Ablation Order
1. **Start with Loss Weight Ablations** (#1) - Most fundamental to our approach
2. **Teacher-Student Ablation** (#2) - Validates DINOv2 adaptation
3. **Oracle Source Ablation** (#3) - Validates preprocessing strategy
4. **Component Contribution** (#8) - Systematic analysis
5. **DA3 Techniques** (#6) - Validates DA3 integration
6. **Multi-View Consistency** (#5) - Optimizes training efficiency
7. **Uncertainty Weighting** (#4) - Fine-tuning
8. **Preprocessing Phase** (#7) - Efficiency validation
Each ablation should be run with:
- Same random seed (for reproducibility)
- Same dataset split
- Same number of epochs
- W&B tracking enabled for easy comparison
## Training Datasets
Depth Anything 3 (DA3) was trained exclusively on **public academic datasets**. The following table documents all datasets used in DA3 training, their sources, and availability status for YLFF:
| Dataset | # Scenes | Data Type | Source / URL | YLFF Status | Notes |
| ------------------------------------ | -------- | --------- | ----------------------------------------------------------------------------------------------- | ---------------- | ------------------------------ |
| **Synthetic Datasets** |
| AriaDigitalTwin | 237 | Synthetic | [Aria Digital Twin](https://github.com/facebookresearch/AriaDigitalTwin) | ❌ Not Available | Meta's AR dataset |
| AriaSyntheticENV | 99,950 | Synthetic | [Aria Synthetic](https://github.com/facebookresearch/AriaDigitalTwin) | ❌ Not Available | Large-scale synthetic AR |
| HyperSim | 344 | Synthetic | [HyperSim](https://github.com/apple/ml-hypersim) | ❌ Not Available | Apple's photorealistic dataset |
| MegaSynth | 6,049 | Synthetic | Unknown | ❓ To Verify | Synthetic multi-view |
| MvsSynth | 121 | Synthetic | Unknown | ❓ To Verify | Multi-view stereo synthetic |
| Objaverse | 505,557 | Synthetic | [Objaverse](https://objaverse.allenai.org/) | ❓ To Verify | Large-scale 3D objects |
| Omniobject | 5,885 | Synthetic | [OmniObject3D](https://omniobject3d.github.io/) | ❓ To Verify | Object-centric dataset |
| OmniWorld | 1,039 | Synthetic | [OmniWorld](https://arxiv.org/abs/2509.12201) | ❓ To Verify | Multi-domain dataset |
| PointOdyssey | 44 | Synthetic | [PointOdyssey](https://pointodyssey.com/) | ❓ To Verify | Long-term point tracking |
| ReplicaVMAP | 17 | Synthetic | [Replica](https://github.com/facebookresearch/Replica-Dataset) | ❓ To Verify | Indoor scene dataset |
| ScenenetRGBD | 16,866 | Synthetic | [SceneNet RGB-D](https://robotvault.bitbucket.io/scenenet-rgbd.html) | ❓ To Verify | Indoor RGB-D scenes |
| TartanAir | 355 | Synthetic | [TartanAir](https://theairlab.org/tartanair-dataset/) | ❓ To Verify | Large-scale simulation |
| Trellis | 557,408 | Synthetic | Unknown | ❓ To Verify | Large-scale synthetic |
| vKitti2 | 50 | Synthetic | [vKITTI2](https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds-vkitti-2/) | ❓ To Verify | Virtual KITTI |
| **Real-World Datasets (LiDAR)** |
| ARKitScenes | 4,388 | LiDAR | [ARKitScenes](https://github.com/apple/ARKitScenes) | ✅ **Available** | **Primary dataset for YLFF** |
| ScanNet++ | 230 | LiDAR | [ScanNet++](https://github.com/ScanNet/ScanNetPlusPlus) | ❓ To Verify | High-fidelity indoor |
| WildRGBD | 23,050 | LiDAR | [WildRGBD](https://wildrgbd.github.io/) | ❓ To Verify | Large-scale RGB-D |
| **Real-World Datasets (COLMAP/SfM)** |
| BlendedMVS | 503 | 3D Recon | [BlendedMVS](https://github.com/YoYo000/BlendedMVS) | ❓ To Verify | Multi-view stereo |
| Co3dv2 | 30,616 | COLMAP | [Common Objects in 3D](https://github.com/facebookresearch/co3d) | ❓ To Verify | Object-centric |
| DL3DV | 6,379 | COLMAP | [DL3DV-10K](https://github.com/OpenGVLab/DL3DV) | ❓ To Verify | Large-scale 3D vision |
| MapFree | 921 | COLMAP | [Map-free Visual Relocalization](https://github.com/nianticlabs/map-free-reloc) | ❓ To Verify | Visual relocalization |
| MegaDepth | 268 | COLMAP | [MegaDepth](https://www.cs.cornell.edu/projects/megadepth/) | ❓ To Verify | Internet photos |
**Legend:**
- ✅ **Available**: Dataset is accessible and can be used for YLFF training
- ❌ **Not Available**: Dataset is not accessible (proprietary, requires special access, etc.)
- ❓ **To Verify**: Dataset availability needs to be confirmed
### Dataset Statistics
**Total Training Data:**
- **Synthetic**: ~1,093,000 scenes (majority from Objaverse and Trellis)
- **Real-World LiDAR**: ~27,668 scenes (ARKitScenes, ScanNet++, WildRGBD)
- **Real-World COLMAP**: ~38,687 scenes (BlendedMVS, Co3dv2, DL3DV, MapFree, MegaDepth)
- **Total**: ~1,159,355 scenes
**Data Type Distribution:**
- **Synthetic**: 94.3% (provides high-quality dense depth)
- **LiDAR**: 2.4% (provides metric accuracy)
- **COLMAP/SfM**: 3.3% (provides multi-view geometry)
### YLFF Dataset Strategy
YLFF currently focuses on **ARKitScenes** as the primary training dataset because:
1. ✅ **Available**: Publicly accessible dataset
2. ✅ **High Quality**: LiDAR depth provides metric accuracy
3. ✅ **Real-World**: Captures real indoor scenes with natural variations
4. ✅ **Rich Metadata**: Includes poses, intrinsics, and LiDAR depth
5. ✅ **Large Scale**: 4,388 scenes provide substantial training data
**Future Dataset Integration:**
- Priority: ScanNet++, WildRGBD (LiDAR datasets for metric accuracy)
- Secondary: DL3DV, Co3dv2 (COLMAP datasets for multi-view geometry)
- Synthetic: Consider for teacher model training (if accessible)
### Dataset Access Notes
- **ARKitScenes**: Download from [official repository](https://github.com/apple/ARKitScenes)
- **ScanNet++**: Requires registration and approval
- **COLMAP datasets**: Most are publicly available but may require preprocessing
- **Synthetic datasets**: Many require special access or are proprietary
For detailed dataset preparation and preprocessing instructions, see `docs/DATASET_PREPARATION.md` (to be created).
### Loss Components
The training uses geometric losses as the primary objective:
1. **Multi-View Geometric Consistency** (weight: 3.0)
- Enforces that the same 3D point projects correctly across views
- Uses back-projection + projection across multiple views
- **This is treated as a first-order objective, not regularization**
2. **Absolute Scale Loss** (weight: 2.5)
- Direct supervision from LiDAR/BA depth
- Enforces correct absolute depth values in meters
- Critical for metric accuracy
3. **Pose Geometric Loss** (weight: 2.0)
- Reprojection error using predicted poses
- Enforces geometric consistency between poses and depth
- Multi-view pose consistency is paramount
4. **Gradient Loss** (weight: 1.0)
- Preserves sharp depth boundaries
- Ensures smoothness in planar regions
- DA3 technique for better depth quality
5. **Teacher-Student Consistency** (weight: 0.5)
- L1 loss between student and teacher predictions
- Encourages stable training
- Prevents student from diverging
## Project Structure
```
ylff/
├── ylff/ # Main package
│ ├── services/ # Business logic
│ │ ├── ylff_training.py # ⭐ Unified training service
│ │ ├── preprocessing.py # Offline preprocessing (BA, uncertainty)
│ │ ├── preprocessed_dataset.py # Dataset for pre-computed results
│ │ ├── ba_validator.py # BA validation pipeline
│ │ ├── arkit_processor.py # ARKit data processing
│ │ ├── evaluate.py # Evaluation metrics
│ │ └── ... # Other services
│ │
│ ├── utils/ # Utilities
│ │ ├── geometric_losses.py # Geometric loss functions
│ │ ├── oracle_uncertainty.py # Oracle uncertainty propagation
│ │ ├── oracle_losses.py # Oracle-weighted losses
│ │ └── ... # Other utilities
│ │
│ ├── routers/ # FastAPI route handlers
│ ├── models/ # Pydantic API models
│ └── cli.py # Command-line interface
│
├── configs/ # Configuration files
│ ├── dinov2_train_config.yaml # Training configuration
│ └── ba_config.yaml # BA pipeline configuration
│
├── docs/ # Documentation
│ ├── UNIFIED_TRAINING.md # Unified training guide
│ ├── TRAINING_PIPELINE_ARCHITECTURE.md
│ └── ... # Other documentation
│
└── research_docs/ # Research documentation
└── MODEL_ARCH.md # Model architecture details
```
## CLI Commands
### Preprocessing
- `ylff preprocess arkit ` - Pre-process ARKit sequences (offline)
### Training
- `ylff train unified ` - Train using unified training service
### Validation
- `ylff validate sequence ` - Validate a single sequence
- `ylff validate arkit [--gui]` - Validate ARKit data (with optional GUI)
### Evaluation
- `ylff eval ba-agreement ` - Evaluate model agreement with BA
### Visualization
- `ylff visualize ` - Generate static visualizations
## Complete Workflow
### Step 1: Pre-Process All Sequences
```bash
# Pre-process all ARKit sequences (one-time, can run overnight)
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed \
--model-name depth-anything/DA3-LARGE \
--num-workers 8 \
--prefer-arkit-poses \
--use-lidar
```
This:
- Extracts ARKit data (poses, LiDAR depth) - FREE
- Runs DA3 inference (GPU, batchable)
- Runs BA only for sequences with poor ARKit tracking
- Computes oracle uncertainty
- Saves everything to cache
### Step 2: Train with Unified Service
```bash
# Train using pre-computed results (fast iteration)
ylff train unified cache/preprocessed \
--model-name depth-anything/DA3-LARGE \
--epochs 200 \
--lr 2e-4 \
--batch-size 32 \
--checkpoint-dir checkpoints \
--use-wandb \
--wandb-project ylff-training
```
This:
- Loads pre-computed oracle results (fast, disk I/O)
- Runs DA3 inference (current model, GPU)
- Computes geometric losses (primary objective)
- Updates model weights with teacher-student learning
### Step 3: Evaluate
```bash
# Evaluate fine-tuned model
ylff eval ba-agreement data/test \
--checkpoint checkpoints/best_model.pt
```
## Configuration
Configuration files are in `configs/`:
- `dinov2_train_config.yaml` - Unified training configuration
- Optimizer settings (DINOv2 style)
- Loss weights (geometric consistency first)
- Teacher-student settings
- Multi-resolution and multi-view training
- `ba_config.yaml` - BA pipeline settings
## Documentation
- **Unified Training**: `docs/UNIFIED_TRAINING.md` - Complete guide to unified training
- **Training Pipeline**: `docs/TRAINING_PIPELINE_ARCHITECTURE.md` - Two-phase pipeline architecture
- **Model Architecture**: `research_docs/MODEL_ARCH.md` - Detailed architecture and training approach
- **API Documentation**: `docs/API.md` - API reference
- **ARKit Integration**: `docs/ARKIT_INTEGRATION.md` - ARKit data processing
## Key Design Decisions
### Why Geometric Consistency First?
Traditional depth estimation models prioritize perceptual quality (how realistic the depth looks) over geometric accuracy (how accurate the absolute scale and multi-view consistency are). YLFF reverses this priority:
- **Geometric consistency** ensures that the same 3D point projects correctly across views
- **Absolute scale** ensures metric accuracy (depth in meters, not just relative)
- **Pose consistency** ensures that predicted poses align with depth predictions
This approach is essential for applications requiring accurate 3D reconstruction, SLAM, and metric depth estimation.
### Why Two-Phase Pipeline?
BA computation is expensive (5-15 minutes per sequence) and cannot run during training. The two-phase pipeline:
1. **Pre-processing** (offline): Compute BA once, cache results
2. **Training** (online): Load cached results, train fast
This enables 100-1000x faster training iteration while still using BA as supervision.
### Why Teacher-Student Learning?
DINOv2's teacher-student paradigm provides:
- **Stability**: EMA teacher prevents training instability
- **Better convergence**: Teacher provides stable targets
- **Scalability**: Works well with large-scale training
## Development
### Running Tests
```bash
# Basic smoke test
python scripts/tests/smoke_test_basic.py
# GUI test
python scripts/tests/test_gui_simple.py
```
### Code Quality
```bash
# Format code
black ylff/ scripts/
# Sort imports
isort ylff/ scripts/
# Type checking
mypy ylff/
```
## Dependencies
### Core Dependencies
- PyTorch >= 2.0
- NumPy < 2.0
- OpenCV
- pycolmap >= 0.4.0
- Typer (for CLI)
### Optional Dependencies
- **GUI**: Plotly (for interactive 3D plots)
- **BA Pipeline**: hloc, LightGlue (installed from source)
- **Training**: Weights & Biases (for experiment tracking)
See `pyproject.toml` for complete dependency list.
## License
Apache-2.0
## Citation
If you use YLFF in your research, please cite:
```bibtex
@software{ylff2024,
title={You Learn From Failure: Geometric Consistency First Training for Visual Geometry},
author={YLFF Contributors},
year={2024},
url={https://github.com/your-org/ylff}
}
```
## References
- **DINOv2**: https://github.com/facebookresearch/dinov2
- **DA3 Paper**: Depth Anything 3 (arXiv:2511.10647)
- **Unified Training**: `ylff/services/ylff_training.py`
- **Model Architecture**: `research_docs/MODEL_ARCH.md`