Spaces:

azan888
/

3d_model

Sleeping

File size: 37,585 Bytes

7a87926

---
title: YLFF Training
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---

# You Learn From Failure (YLFF)

**Geometric Consistency First: Training Visual Geometry Models with BA Supervision**

## Overview

YLFF is a unified framework for training geometrically accurate depth estimation models using Bundle Adjustment (BA) and LiDAR as oracle teachers. Unlike traditional approaches that prioritize perceptual quality, YLFF treats **geometric consistency as a first-order goal**.

### Core Philosophy

**Geometric Accuracy > Perceptual Quality**

- Multi-view geometric consistency is the **primary objective** (not just regularization)
- Absolute scale accuracy is **critical** for metric depth estimation
- Multi-view pose consistency is **essential** for 3D reconstruction
- Teacher-student learning provides **stability** during training

## End-to-End Pipeline

The complete YLFF pipeline from data collection to trained model:

```mermaid
flowchart TD
    Start([Start: Data Collection]) --> Upload[Upload ARKit Sequences]
    Upload --> Extract[Extract ARKit Data<br/>Poses, LiDAR, Intrinsics]

    Extract --> Preprocess{Pre-Processing Phase<br/>Offline, Expensive}

    Preprocess --> DA3Infer[Run DA3 Inference<br/>Initial Predictions]
    DA3Infer --> QualityCheck{ARKit Quality<br/>Check}

    QualityCheck -->|High Quality<br/>≥ 0.8| UseARKit[Use ARKit Poses<br/>Skip BA]
    QualityCheck -->|Low Quality<br/>&lt; 0.8| RunBA[Run BA Validation<br/>Refine Poses]

    UseARKit --> OracleUncertainty[Compute Oracle Uncertainty<br/>Confidence Maps]
    RunBA --> OracleUncertainty

    OracleUncertainty --> SelectTargets[Select Oracle Targets<br/>BA or ARKit Poses]
    SelectTargets --> Cache[Save to Cache<br/>oracle_targets.npz<br/>uncertainty_results.npz]

    Cache --> TrainingPhase{Training Phase<br/>Online, Fast}

    TrainingPhase --> LoadCache[Load Pre-Computed<br/>Oracle Results]
    LoadCache --> LoadModel[Load/Resume Model<br/>Student + Teacher]

    LoadModel --> TrainingLoop[Training Loop]

    TrainingLoop --> Forward[Forward Pass<br/>Student Model Inference]
    Forward --> ComputeLoss[Compute Geometric Losses<br/>Multi-view: 3.0<br/>Absolute Scale: 2.5<br/>Pose: 2.0<br/>Gradient: 1.0<br/>Teacher: 0.5]

    ComputeLoss --> Backward[Backward Pass<br/>Gradient Computation]
    Backward --> ClipGrad[Gradient Clipping<br/>Max Norm: 1.0]
    ClipGrad --> Update[Update Weights<br/>AdamW Optimizer]

    Update --> UpdateTeacher[Update Teacher Model<br/>EMA Decay: 0.999]
    UpdateTeacher --> Scheduler[Update Learning Rate<br/>Cosine Annealing]

    Scheduler --> Checkpoint{Checkpoint<br/>Interval?}

    Checkpoint -->|Every N Steps| SaveCheckpoint[Save Checkpoint<br/>Periodic + Best + Latest]
    Checkpoint -->|Continue| LogMetrics[Log Metrics<br/>W&B / Console]

    SaveCheckpoint --> LogMetrics
    LogMetrics --> EpochComplete{Epoch<br/>Complete?}

    EpochComplete -->|No| TrainingLoop
    EpochComplete -->|Yes| MoreEpochs{More<br/>Epochs?}

    MoreEpochs -->|Yes| TrainingLoop
    MoreEpochs -->|No| SaveFinal[Save Final Checkpoint<br/>Final Model State]

    SaveFinal --> Evaluate[Evaluate Model<br/>BA Agreement]
    Evaluate --> Results[Training Results<br/>Metrics & Checkpoints]

    Results --> Resume{Resume<br/>Training?}
    Resume -->|Yes| LoadCheckpoint[Load Checkpoint<br/>latest_checkpoint.pt]
    LoadCheckpoint --> LoadModel
    Resume -->|No| End([End: Trained Model])

    style Preprocess fill:#e1f5ff
    style TrainingPhase fill:#fff4e1
    style ComputeLoss fill:#ffe1f5
    style SaveCheckpoint fill:#e1ffe1
    style Evaluate fill:#f5e1ff
```

### Pipeline Stages

#### 1. Data Collection & Upload

- **Input**: ARKit sequences (video + metadata.json)
- **Extract**: Poses, LiDAR depth, camera intrinsics
- **Output**: Structured ARKit data

#### 2. Pre-Processing Phase (Offline)

- **DA3 Inference**: Initial depth/pose predictions (GPU)
- **Quality Check**: Evaluate ARKit tracking quality
- **BA Validation**: Run only if ARKit quality < threshold (CPU, expensive)
- **Oracle Uncertainty**: Compute confidence maps from multiple sources
- **Cache Results**: Save oracle targets and uncertainty to disk
- **Time**: ~10-20 min per sequence (one-time cost)

#### 3. Training Phase (Online)

- **Load Cache**: Fast disk I/O of pre-computed results
- **Model Loading**: Load or resume from checkpoint (student + teacher)
- **Training Loop**:
  - Forward pass through student model
  - Compute geometric losses (primary objective)
  - Backward pass with gradient clipping
  - Update weights (AdamW optimizer)
  - Update teacher model (EMA)
  - Update learning rate (cosine scheduler)
- **Checkpointing**: Save periodic, best, and latest checkpoints
- **Logging**: Metrics to W&B and console
- **Time**: ~1-3 sec per sequence (100-1000x faster than BA)

#### 4. Evaluation & Resumption

- **Evaluation**: Test model agreement with BA
- **Resume**: Load checkpoint to continue training
- **Final Model**: Best checkpoint saved for deployment

## Key Features

### 🎯 Unified Training Approach

- **Single Training Service**: `ylff/services/ylff_training.py` consolidates all training methods
- **DINOv2 Backbone**: Teacher-student paradigm with EMA teacher for stable training
- **DA3 Techniques**: Depth-ray representation, multi-resolution training
- **Geometric Losses**: Multi-view consistency, absolute scale, pose accuracy as primary objectives

### 📊 Two-Phase Pipeline

1. **Pre-Processing Phase** (offline, expensive)

   - Compute BA validation and oracle uncertainty
   - Cache results for fast training iteration
   - Can be parallelized across sequences

2. **Training Phase** (online, fast)
   - Load pre-computed oracle results
   - Train with geometric losses as primary objective
   - 100-1000x faster than computing BA during training

### 🔧 Core Components

- **BA Validation**: Validate model predictions using COLMAP Bundle Adjustment
- **ARKit Integration**: Process ARKit data with ground truth poses and LiDAR depth
- **Oracle Uncertainty**: Continuous confidence weighting (not binary rejection)
- **Geometric Losses**: Multi-view consistency, absolute scale, pose reprojection error
- **Unified Training**: Single training service with geometric consistency first

## Installation

### Basic Installation

```bash
# Clone repository
git clone <repository-url>
cd ylff

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install package
pip install -e .

# Install optional dependencies
pip install -e ".[gui]"  # For GUI visualization
```

### BA Pipeline Setup

For BA validation, you need additional dependencies:

```bash
# Install BA pipeline dependencies
bash scripts/bin/setup_ba_pipeline.sh

# Or manually:
pip install pycolmap
# Install hloc from source (see docs/SETUP.md)
# Install LightGlue from source (see docs/SETUP.md)
```

See `docs/SETUP.md` for detailed installation instructions.

## Quick Start

### 1. Pre-Process ARKit Sequences

```bash
# Pre-process ARKit sequences (offline, can run overnight)
ylff preprocess arkit data/arkit_sequences \
    --output-cache cache/preprocessed \
    --model-name depth-anything/DA3-LARGE \
    --num-workers 8 \
    --prefer-arkit-poses
```

This computes BA and oracle uncertainty for all sequences and caches results.

### 2. Train with Unified Service

```bash
# Train using pre-computed results (fast iteration)
ylff train unified cache/preprocessed \
    --model-name depth-anything/DA3-LARGE \
    --epochs 200 \
    --lr 2e-4 \
    --batch-size 32 \
    --checkpoint-dir checkpoints \
    --use-wandb
```

Or use the Python API:

```python
from ylff.services.ylff_training import train_ylff
from ylff.services.preprocessed_dataset import PreprocessedARKitDataset

# Load preprocessed dataset
dataset = PreprocessedARKitDataset(
    cache_dir="cache/preprocessed",
    arkit_sequences_dir="data/arkit_sequences",
    load_images=True,
)

# Train with unified service
metrics = train_ylff(
    model=da3_model,
    dataset=dataset,
    epochs=200,
    lr=2e-4,
    batch_size=32,
    loss_weights={
        'geometric_consistency': 3.0,  # PRIMARY GOAL
        'absolute_scale': 2.5,  # CRITICAL
        'pose_geometric': 2.0,  # ESSENTIAL
    },
    use_wandb=True,
    checkpoint_dir=Path("checkpoints"),
)
```

### 3. Validate Sequences

```bash
# Validate a sequence of images
ylff validate sequence path/to/images \
    --model-name depth-anything/DA3-LARGE \
    --accept-threshold 2.0 \
    --reject-threshold 30.0 \
    --output results.json
```

### 4. Evaluate Model

```bash
# Evaluate model agreement with BA
ylff eval ba-agreement path/to/test/sequences \
    --model-name depth-anything/DA3-LARGE \
    --checkpoint checkpoints/best_model.pt \
    --threshold 2.0
```

## Training Approach

### Unified Training Service

YLFF uses a **single, unified training service** (`ylff/services/ylff_training.py`) that:

1. **Uses DINOv2's teacher-student paradigm** as the backbone

   - EMA teacher provides stable targets
   - Layer-wise learning rate decay
   - Cosine scheduler with warmup

2. **Incorporates DA3 techniques**

   - Depth-ray representation (if available)
   - Multi-resolution training support
   - Scale normalization

3. **Treats geometric consistency as first-order goal**
   - Multi-view geometric consistency: **weight 3.0** (PRIMARY)
   - Absolute scale loss: **weight 2.5** (CRITICAL)
   - Pose geometric loss: **weight 2.0** (ESSENTIAL)
   - Gradient loss: **weight 1.0** (DA3 technique)
   - Teacher-student consistency: **weight 0.5** (STABILITY)

### Experiment Tracking & Ablations

YLFF integrates **Weights & Biases (W&B)** for comprehensive experiment tracking and ablation studies:

**Logged Configuration** (per run):

- Training hyperparameters: `epochs`, `lr`, `batch_size`, `ema_decay`
- Loss weights: All component weights (geometric_consistency, absolute_scale, pose_geometric, gradient_loss, teacher_consistency)
- Model configuration: Task type, device, precision (FP16/BF16)

**Logged Metrics** (per step):

- **Loss Components**: All individual loss terms tracked separately
  - `total_loss`: Overall training loss
  - `geometric_consistency`: Multi-view consistency loss
  - `absolute_scale`: Absolute depth scale loss
  - `pose_geometric`: Pose reprojection error loss
  - `gradient_loss`: Depth gradient loss
  - `teacher_consistency`: Teacher-student consistency loss
- **Training State**: `step`, `epoch`, `lr` (learning rate over time)

**Ablation Study Support**:

- **Compare runs**: Filter by hyperparameters (loss weights, learning rate, etc.)
- **Track component contributions**: See how each loss component evolves
- **Hyperparameter sweeps**: Use W&B sweeps to systematically explore configurations
- **Reproducibility**: All hyperparameters logged in config for exact reproduction

**Example Ablation Workflow**:

```bash
# Run 1: Baseline (default geometric-first weights)
ylff train unified cache/preprocessed \
    --epochs 200 \
    --use-wandb \
    --wandb-project ylff-ablations \
    --wandb-name baseline-geometric-first

# Run 2: Ablation: Lower geometric consistency weight
ylff train unified cache/preprocessed \
    --epochs 200 \
    --use-wandb \
    --wandb-project ylff-ablations \
    --wandb-name ablation-lower-geo-weight \
    --loss-weight-geometric-consistency 1.0  # vs default 3.0

# Run 3: Ablation: No teacher-student consistency
ylff train unified cache/preprocessed \
    --epochs 200 \
    --use-wandb \
    --wandb-project ylff-ablations \
    --wandb-name ablation-no-teacher \
    --loss-weight-teacher-consistency 0.0  # Disable teacher loss

# Compare in W&B dashboard:
# - Filter by project: "ylff-ablations"
# - Compare loss curves across runs
# - Analyze which loss components matter most
```

**W&B Dashboard Features**:

- **Parallel coordinates plot**: Visualize hyperparameter relationships
- **Loss curves**: Compare training dynamics across ablations
- **Component analysis**: See contribution of each loss term
- **Best run identification**: Automatically identify best configurations

### Suggested Ablation Studies

Based on YLFF's architecture, here are key ablation experiments to validate our design choices:

#### 1. Loss Weight Ablations (Geometric Consistency First)

**Question**: How critical is treating geometric consistency as a first-order goal?

```python
from ylff.services.ylff_training import train_ylff
from ylff.services.preprocessed_dataset import PreprocessedARKitDataset

# Baseline: Geometric-first (default)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    use_wandb=True,
    wandb_project="ylff-ablations",
    loss_weights={
        'geometric_consistency': 3.0,  # PRIMARY GOAL
        'absolute_scale': 2.5,
        'pose_geometric': 2.0,
        'gradient_loss': 1.0,
        'teacher_consistency': 0.5,
    },
)

# Ablation 1: Equal weights (traditional approach)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    use_wandb=True,
    wandb_project="ylff-ablations",
    loss_weights={
        'geometric_consistency': 1.0,  # Equal weight
        'absolute_scale': 1.0,
        'pose_geometric': 1.0,
        'gradient_loss': 1.0,
        'teacher_consistency': 0.5,
    },
)

# Ablation 2: Perceptual-first (reverse priority)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    use_wandb=True,
    wandb_project="ylff-ablations",
    loss_weights={
        'geometric_consistency': 0.5,  # Lower priority
        'absolute_scale': 0.5,
        'pose_geometric': 0.5,
        'gradient_loss': 3.0,  # Emphasize smoothness
        'teacher_consistency': 0.5,
    },
)

# Ablation 3: Remove geometric consistency entirely
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    use_wandb=True,
    wandb_project="ylff-ablations",
    loss_weights={
        'geometric_consistency': 0.0,  # Disabled
        'absolute_scale': 2.5,
        'pose_geometric': 2.0,
        'gradient_loss': 1.0,
        'teacher_consistency': 0.5,
    },
)
```

**Metrics to Compare**:

- Final geometric consistency loss
- BA agreement (reprojection error)
- Absolute scale accuracy (vs LiDAR)
- Multi-view reconstruction quality

#### 2. Teacher-Student Ablation

**Question**: Does EMA teacher provide training stability and better convergence?

```python
# Baseline: With EMA teacher (default ema_decay=0.999)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    ema_decay=0.999,
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Ablation 1: No teacher-student (ema_decay=0.0)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    ema_decay=0.0,  # No EMA updates
    loss_weights={
        'geometric_consistency': 3.0,
        'absolute_scale': 2.5,
        'pose_geometric': 2.0,
        'gradient_loss': 1.0,
        'teacher_consistency': 0.0,  # Disable teacher loss
    },
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Ablation 2: Faster teacher updates (ema_decay=0.99)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    ema_decay=0.99,  # Faster updates
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Ablation 3: Slower teacher updates (ema_decay=0.9999)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    ema_decay=0.9999,  # Slower updates
    use_wandb=True,
    wandb_project="ylff-ablations",
)
```

**Metrics to Compare**:

- Training stability (loss variance)
- Convergence speed
- Final model quality
- Teacher-student consistency loss

#### 3. Oracle Source Ablation (BA vs ARKit)

**Question**: How much does BA refinement improve over ARKit poses?

```bash
# Baseline: Use BA when ARKit quality < 0.8 (default)
ylff preprocess arkit data/arkit_sequences \
    --output-cache cache/preprocessed-ba \
    --prefer-arkit-poses --min-arkit-quality 0.8

ylff train unified cache/preprocessed-ba \
    --use-wandb --wandb-project ylff-ablations

# Ablation 1: Always use ARKit (no BA, faster preprocessing)
ylff preprocess arkit data/arkit_sequences \
    --output-cache cache/preprocessed-arkit-only \
    --prefer-arkit-poses --min-arkit-quality 0.0

ylff train unified cache/preprocessed-arkit-only \
    --use-wandb --wandb-project ylff-ablations

# Ablation 2: Always use BA (expensive but highest quality)
ylff preprocess arkit data/arkit_sequences \
    --output-cache cache/preprocessed-ba-always \
    --prefer-arkit-poses --min-arkit-quality 1.0  # Never use ARKit

ylff train unified cache/preprocessed-ba-always \
    --use-wandb --wandb-project ylff-ablations
```

**Metrics to Compare**:

- Pose accuracy (reprojection error)
- Training data quality (confidence scores)
- Final model performance
- Preprocessing time cost

#### 4. Uncertainty Weighting Ablation

**Question**: Does confidence-weighted loss improve training vs uniform weighting?

```bash
# Baseline: With uncertainty weighting (default)
# Uses depth_confidence and pose_confidence from preprocessing

# Ablation: Uniform weighting (ignore uncertainty)
# Modify preprocessing to set all confidence = 1.0
# Or modify loss computation to ignore confidence maps
```

**Metrics to Compare**:

- Loss on high-confidence vs low-confidence regions
- Model performance on uncertain scenes
- Training stability

#### 5. Multi-View Consistency Ablation

**Question**: How many views are needed for effective geometric consistency?

```python
# Baseline: Variable views (2-18, default from dataset)
train_ylff(
    model=model,
    dataset=dataset,  # Uses all available views
    epochs=200,
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Ablation 1: Single view only (disable geometric consistency)
train_ylff(
    model=model,
    dataset=single_view_dataset,  # Modified dataset with 1 view
    epochs=200,
    loss_weights={
        'geometric_consistency': 0.0,  # Disabled (needs 2+ views)
        'absolute_scale': 2.5,
        'pose_geometric': 2.0,
        'gradient_loss': 1.0,
        'teacher_consistency': 0.5,
    },
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Ablation 2-4: Fixed N views
# Modify dataset to sample exactly N views per sequence
# Compare: 2 views, 5 views, 10 views, 18 views
```

**Metrics to Compare**:

- Geometric consistency loss
- Multi-view reconstruction accuracy
- Training efficiency (more views = slower)

#### 6. DA3 Techniques Ablation

**Question**: Which DA3 techniques contribute most?

```python
# Baseline: All DA3 techniques enabled
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Ablation 1: No gradient loss (DA3 edge preservation)
train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    loss_weights={
        'geometric_consistency': 3.0,
        'absolute_scale': 2.5,
        'pose_geometric': 2.0,
        'gradient_loss': 0.0,  # Disabled
        'teacher_consistency': 0.5,
    },
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Ablation 2: No depth-ray representation
# Use model that outputs separate depth + poses instead of depth-ray
# (Requires different model architecture)

# Ablation 3: Fixed resolution (no multi-resolution training)
# Modify dataset to use fixed resolution instead of variable
```

**Metrics to Compare**:

- Depth edge quality (gradient loss ablation)
- Training efficiency (multi-resolution ablation)
- Model generalization

#### 7. Preprocessing Phase Ablation

**Question**: How much does the two-phase pipeline improve training efficiency?

```bash
# Baseline: With preprocessing (fast training)
ylff preprocess arkit data/arkit_sequences --output-cache cache/preprocessed
ylff train unified cache/preprocessed \
    --use-wandb --wandb-project ylff-ablations \
    --wandb-name baseline-with-preprocessing

# Ablation: Live BA during training (slow but no preprocessing)
# This would require modifying training to compute BA on-the-fly
# Compare: Training time per epoch, total training time
```

**Metrics to Compare**:

- Training time per epoch
- Total training time
- Model quality (should be similar, preprocessing is just optimization)

#### 8. Loss Component Contribution Analysis

**Question**: Which loss component contributes most to final model quality?

Run systematic sweeps using W&B sweeps or Python script:

```python
# sweep_config.yaml
program: train_ablation_sweep.py
method: grid
parameters:
  loss_weight_geometric_consistency:
    values: [0.0, 1.0, 2.0, 3.0, 4.0]
  loss_weight_absolute_scale:
    values: [0.0, 1.0, 2.0, 2.5, 3.0]
  loss_weight_pose_geometric:
    values: [0.0, 1.0, 2.0, 3.0]
  loss_weight_gradient_loss:
    values: [0.0, 0.5, 1.0, 1.5]
  loss_weight_teacher_consistency:
    values: [0.0, 0.25, 0.5, 0.75, 1.0]

# train_ablation_sweep.py
import wandb
from ylff.services.ylff_training import train_ylff

wandb.init()
config = wandb.config

train_ylff(
    model=model,
    dataset=dataset,
    epochs=200,
    loss_weights={
        'geometric_consistency': config.loss_weight_geometric_consistency,
        'absolute_scale': config.loss_weight_absolute_scale,
        'pose_geometric': config.loss_weight_pose_geometric,
        'gradient_loss': config.loss_weight_gradient_loss,
        'teacher_consistency': config.loss_weight_teacher_consistency,
    },
    use_wandb=True,
    wandb_project="ylff-ablations",
)

# Run: wandb sweep sweep_config.yaml
```

**Analysis**:

- Use W&B parallel coordinates plot to find optimal weight combinations
- Identify which components are essential vs optional
- Find Pareto frontier (best quality for given training time)

#### Recommended Ablation Order

1. **Start with Loss Weight Ablations** (#1) - Most fundamental to our approach
2. **Teacher-Student Ablation** (#2) - Validates DINOv2 adaptation
3. **Oracle Source Ablation** (#3) - Validates preprocessing strategy
4. **Component Contribution** (#8) - Systematic analysis
5. **DA3 Techniques** (#6) - Validates DA3 integration
6. **Multi-View Consistency** (#5) - Optimizes training efficiency
7. **Uncertainty Weighting** (#4) - Fine-tuning
8. **Preprocessing Phase** (#7) - Efficiency validation

Each ablation should be run with:

- Same random seed (for reproducibility)
- Same dataset split
- Same number of epochs
- W&B tracking enabled for easy comparison

## Training Datasets

Depth Anything 3 (DA3) was trained exclusively on **public academic datasets**. The following table documents all datasets used in DA3 training, their sources, and availability status for YLFF:

| Dataset                              | # Scenes | Data Type | Source / URL                                                                                    | YLFF Status      | Notes                          |
| ------------------------------------ | -------- | --------- | ----------------------------------------------------------------------------------------------- | ---------------- | ------------------------------ |
| **Synthetic Datasets**               |
| AriaDigitalTwin                      | 237      | Synthetic | [Aria Digital Twin](https://github.com/facebookresearch/AriaDigitalTwin)                        | ❌ Not Available | Meta's AR dataset              |
| AriaSyntheticENV                     | 99,950   | Synthetic | [Aria Synthetic](https://github.com/facebookresearch/AriaDigitalTwin)                           | ❌ Not Available | Large-scale synthetic AR       |
| HyperSim                             | 344      | Synthetic | [HyperSim](https://github.com/apple/ml-hypersim)                                                | ❌ Not Available | Apple's photorealistic dataset |
| MegaSynth                            | 6,049    | Synthetic | Unknown                                                                                         | ❓ To Verify     | Synthetic multi-view           |
| MvsSynth                             | 121      | Synthetic | Unknown                                                                                         | ❓ To Verify     | Multi-view stereo synthetic    |
| Objaverse                            | 505,557  | Synthetic | [Objaverse](https://objaverse.allenai.org/)                                                     | ❓ To Verify     | Large-scale 3D objects         |
| Omniobject                           | 5,885    | Synthetic | [OmniObject3D](https://omniobject3d.github.io/)                                                 | ❓ To Verify     | Object-centric dataset         |
| OmniWorld                            | 1,039    | Synthetic | [OmniWorld](https://arxiv.org/abs/2509.12201)                                                   | ❓ To Verify     | Multi-domain dataset           |
| PointOdyssey                         | 44       | Synthetic | [PointOdyssey](https://pointodyssey.com/)                                                       | ❓ To Verify     | Long-term point tracking       |
| ReplicaVMAP                          | 17       | Synthetic | [Replica](https://github.com/facebookresearch/Replica-Dataset)                                  | ❓ To Verify     | Indoor scene dataset           |
| ScenenetRGBD                         | 16,866   | Synthetic | [SceneNet RGB-D](https://robotvault.bitbucket.io/scenenet-rgbd.html)                            | ❓ To Verify     | Indoor RGB-D scenes            |
| TartanAir                            | 355      | Synthetic | [TartanAir](https://theairlab.org/tartanair-dataset/)                                           | ❓ To Verify     | Large-scale simulation         |
| Trellis                              | 557,408  | Synthetic | Unknown                                                                                         | ❓ To Verify     | Large-scale synthetic          |
| vKitti2                              | 50       | Synthetic | [vKITTI2](https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds-vkitti-2/) | ❓ To Verify     | Virtual KITTI                  |
| **Real-World Datasets (LiDAR)**      |
| ARKitScenes                          | 4,388    | LiDAR     | [ARKitScenes](https://github.com/apple/ARKitScenes)                                             | ✅ **Available** | **Primary dataset for YLFF**   |
| ScanNet++                            | 230      | LiDAR     | [ScanNet++](https://github.com/ScanNet/ScanNetPlusPlus)                                         | ❓ To Verify     | High-fidelity indoor           |
| WildRGBD                             | 23,050   | LiDAR     | [WildRGBD](https://wildrgbd.github.io/)                                                         | ❓ To Verify     | Large-scale RGB-D              |
| **Real-World Datasets (COLMAP/SfM)** |
| BlendedMVS                           | 503      | 3D Recon  | [BlendedMVS](https://github.com/YoYo000/BlendedMVS)                                             | ❓ To Verify     | Multi-view stereo              |
| Co3dv2                               | 30,616   | COLMAP    | [Common Objects in 3D](https://github.com/facebookresearch/co3d)                                | ❓ To Verify     | Object-centric                 |
| DL3DV                                | 6,379    | COLMAP    | [DL3DV-10K](https://github.com/OpenGVLab/DL3DV)                                                 | ❓ To Verify     | Large-scale 3D vision          |
| MapFree                              | 921      | COLMAP    | [Map-free Visual Relocalization](https://github.com/nianticlabs/map-free-reloc)                 | ❓ To Verify     | Visual relocalization          |
| MegaDepth                            | 268      | COLMAP    | [MegaDepth](https://www.cs.cornell.edu/projects/megadepth/)                                     | ❓ To Verify     | Internet photos                |

**Legend:**

- ✅ **Available**: Dataset is accessible and can be used for YLFF training
- ❌ **Not Available**: Dataset is not accessible (proprietary, requires special access, etc.)
- ❓ **To Verify**: Dataset availability needs to be confirmed

### Dataset Statistics

**Total Training Data:**

- **Synthetic**: ~1,093,000 scenes (majority from Objaverse and Trellis)
- **Real-World LiDAR**: ~27,668 scenes (ARKitScenes, ScanNet++, WildRGBD)
- **Real-World COLMAP**: ~38,687 scenes (BlendedMVS, Co3dv2, DL3DV, MapFree, MegaDepth)
- **Total**: ~1,159,355 scenes

**Data Type Distribution:**

- **Synthetic**: 94.3% (provides high-quality dense depth)
- **LiDAR**: 2.4% (provides metric accuracy)
- **COLMAP/SfM**: 3.3% (provides multi-view geometry)

### YLFF Dataset Strategy

YLFF currently focuses on **ARKitScenes** as the primary training dataset because:

1. ✅ **Available**: Publicly accessible dataset
2. ✅ **High Quality**: LiDAR depth provides metric accuracy
3. ✅ **Real-World**: Captures real indoor scenes with natural variations
4. ✅ **Rich Metadata**: Includes poses, intrinsics, and LiDAR depth
5. ✅ **Large Scale**: 4,388 scenes provide substantial training data

**Future Dataset Integration:**

- Priority: ScanNet++, WildRGBD (LiDAR datasets for metric accuracy)
- Secondary: DL3DV, Co3dv2 (COLMAP datasets for multi-view geometry)
- Synthetic: Consider for teacher model training (if accessible)

### Dataset Access Notes

- **ARKitScenes**: Download from [official repository](https://github.com/apple/ARKitScenes)
- **ScanNet++**: Requires registration and approval
- **COLMAP datasets**: Most are publicly available but may require preprocessing
- **Synthetic datasets**: Many require special access or are proprietary

For detailed dataset preparation and preprocessing instructions, see `docs/DATASET_PREPARATION.md` (to be created).

### Loss Components

The training uses geometric losses as the primary objective:

1. **Multi-View Geometric Consistency** (weight: 3.0)

   - Enforces that the same 3D point projects correctly across views
   - Uses back-projection + projection across multiple views
   - **This is treated as a first-order objective, not regularization**

2. **Absolute Scale Loss** (weight: 2.5)

   - Direct supervision from LiDAR/BA depth
   - Enforces correct absolute depth values in meters
   - Critical for metric accuracy

3. **Pose Geometric Loss** (weight: 2.0)

   - Reprojection error using predicted poses
   - Enforces geometric consistency between poses and depth
   - Multi-view pose consistency is paramount

4. **Gradient Loss** (weight: 1.0)

   - Preserves sharp depth boundaries
   - Ensures smoothness in planar regions
   - DA3 technique for better depth quality

5. **Teacher-Student Consistency** (weight: 0.5)
   - L1 loss between student and teacher predictions
   - Encourages stable training
   - Prevents student from diverging

## Project Structure

```
ylff/
├── ylff/                          # Main package
│   ├── services/                  # Business logic
│   │   ├── ylff_training.py      # ⭐ Unified training service
│   │   ├── preprocessing.py      # Offline preprocessing (BA, uncertainty)
│   │   ├── preprocessed_dataset.py # Dataset for pre-computed results
│   │   ├── ba_validator.py        # BA validation pipeline
│   │   ├── arkit_processor.py     # ARKit data processing
│   │   ├── evaluate.py            # Evaluation metrics
│   │   └── ...                    # Other services
│   │
│   ├── utils/                     # Utilities
│   │   ├── geometric_losses.py    # Geometric loss functions
│   │   ├── oracle_uncertainty.py  # Oracle uncertainty propagation
│   │   ├── oracle_losses.py       # Oracle-weighted losses
│   │   └── ...                    # Other utilities
│   │
│   ├── routers/                   # FastAPI route handlers
│   ├── models/                    # Pydantic API models
│   └── cli.py                     # Command-line interface
│
├── configs/                        # Configuration files
│   ├── dinov2_train_config.yaml   # Training configuration
│   └── ba_config.yaml             # BA pipeline configuration
│
├── docs/                           # Documentation
│   ├── UNIFIED_TRAINING.md        # Unified training guide
│   ├── TRAINING_PIPELINE_ARCHITECTURE.md
│   └── ...                         # Other documentation
│
└── research_docs/                  # Research documentation
    └── MODEL_ARCH.md              # Model architecture details
```

## CLI Commands

### Preprocessing

- `ylff preprocess arkit <dir>` - Pre-process ARKit sequences (offline)

### Training

- `ylff train unified <cache_dir>` - Train using unified training service

### Validation

- `ylff validate sequence <dir>` - Validate a single sequence
- `ylff validate arkit <dir> [--gui]` - Validate ARKit data (with optional GUI)

### Evaluation

- `ylff eval ba-agreement <dir>` - Evaluate model agreement with BA

### Visualization

- `ylff visualize <results_dir>` - Generate static visualizations

## Complete Workflow

### Step 1: Pre-Process All Sequences

```bash
# Pre-process all ARKit sequences (one-time, can run overnight)
ylff preprocess arkit data/arkit_sequences \
    --output-cache cache/preprocessed \
    --model-name depth-anything/DA3-LARGE \
    --num-workers 8 \
    --prefer-arkit-poses \
    --use-lidar
```

This:

- Extracts ARKit data (poses, LiDAR depth) - FREE
- Runs DA3 inference (GPU, batchable)
- Runs BA only for sequences with poor ARKit tracking
- Computes oracle uncertainty
- Saves everything to cache

### Step 2: Train with Unified Service

```bash
# Train using pre-computed results (fast iteration)
ylff train unified cache/preprocessed \
    --model-name depth-anything/DA3-LARGE \
    --epochs 200 \
    --lr 2e-4 \
    --batch-size 32 \
    --checkpoint-dir checkpoints \
    --use-wandb \
    --wandb-project ylff-training
```

This:

- Loads pre-computed oracle results (fast, disk I/O)
- Runs DA3 inference (current model, GPU)
- Computes geometric losses (primary objective)
- Updates model weights with teacher-student learning

### Step 3: Evaluate

```bash
# Evaluate fine-tuned model
ylff eval ba-agreement data/test \
    --checkpoint checkpoints/best_model.pt
```

## Configuration

Configuration files are in `configs/`:

- `dinov2_train_config.yaml` - Unified training configuration

  - Optimizer settings (DINOv2 style)
  - Loss weights (geometric consistency first)
  - Teacher-student settings
  - Multi-resolution and multi-view training

- `ba_config.yaml` - BA pipeline settings

## Documentation

- **Unified Training**: `docs/UNIFIED_TRAINING.md` - Complete guide to unified training
- **Training Pipeline**: `docs/TRAINING_PIPELINE_ARCHITECTURE.md` - Two-phase pipeline architecture
- **Model Architecture**: `research_docs/MODEL_ARCH.md` - Detailed architecture and training approach
- **API Documentation**: `docs/API.md` - API reference
- **ARKit Integration**: `docs/ARKIT_INTEGRATION.md` - ARKit data processing

## Key Design Decisions

### Why Geometric Consistency First?

Traditional depth estimation models prioritize perceptual quality (how realistic the depth looks) over geometric accuracy (how accurate the absolute scale and multi-view consistency are). YLFF reverses this priority:

- **Geometric consistency** ensures that the same 3D point projects correctly across views
- **Absolute scale** ensures metric accuracy (depth in meters, not just relative)
- **Pose consistency** ensures that predicted poses align with depth predictions

This approach is essential for applications requiring accurate 3D reconstruction, SLAM, and metric depth estimation.

### Why Two-Phase Pipeline?

BA computation is expensive (5-15 minutes per sequence) and cannot run during training. The two-phase pipeline:

1. **Pre-processing** (offline): Compute BA once, cache results
2. **Training** (online): Load cached results, train fast

This enables 100-1000x faster training iteration while still using BA as supervision.

### Why Teacher-Student Learning?

DINOv2's teacher-student paradigm provides:

- **Stability**: EMA teacher prevents training instability
- **Better convergence**: Teacher provides stable targets
- **Scalability**: Works well with large-scale training

## Development

### Running Tests

```bash
# Basic smoke test
python scripts/tests/smoke_test_basic.py

# GUI test
python scripts/tests/test_gui_simple.py
```

### Code Quality

```bash
# Format code
black ylff/ scripts/

# Sort imports
isort ylff/ scripts/

# Type checking
mypy ylff/
```

## Dependencies

### Core Dependencies

- PyTorch >= 2.0
- NumPy < 2.0
- OpenCV
- pycolmap >= 0.4.0
- Typer (for CLI)

### Optional Dependencies

- **GUI**: Plotly (for interactive 3D plots)
- **BA Pipeline**: hloc, LightGlue (installed from source)
- **Training**: Weights & Biases (for experiment tracking)

See `pyproject.toml` for complete dependency list.

## License

Apache-2.0

## Citation

If you use YLFF in your research, please cite:

```bibtex
@software{ylff2024,
  title={You Learn From Failure: Geometric Consistency First Training for Visual Geometry},
  author={YLFF Contributors},
  year={2024},
  url={https://github.com/your-org/ylff}
}
```

## References

- **DINOv2**: https://github.com/facebookresearch/dinov2
- **DA3 Paper**: Depth Anything 3 (arXiv:2511.10647)
- **Unified Training**: `ylff/services/ylff_training.py`
- **Model Architecture**: `research_docs/MODEL_ARCH.md`