--- title: YLFF Training emoji: 🚀 colorFrom: blue colorTo: purple sdk: docker app_port: 7860 --- # You Learn From Failure (YLFF) **Geometric Consistency First: Training Visual Geometry Models with BA Supervision** ## Overview YLFF is a unified framework for training geometrically accurate depth estimation models using Bundle Adjustment (BA) and LiDAR as oracle teachers. Unlike traditional approaches that prioritize perceptual quality, YLFF treats **geometric consistency as a first-order goal**. ### Core Philosophy **Geometric Accuracy > Perceptual Quality** - Multi-view geometric consistency is the **primary objective** (not just regularization) - Absolute scale accuracy is **critical** for metric depth estimation - Multi-view pose consistency is **essential** for 3D reconstruction - Teacher-student learning provides **stability** during training ## End-to-End Pipeline The complete YLFF pipeline from data collection to trained model: ```mermaid flowchart TD Start([Start: Data Collection]) --> Upload[Upload ARKit Sequences] Upload --> Extract[Extract ARKit Data
Poses, LiDAR, Intrinsics] Extract --> Preprocess{Pre-Processing Phase
Offline, Expensive} Preprocess --> DA3Infer[Run DA3 Inference
Initial Predictions] DA3Infer --> QualityCheck{ARKit Quality
Check} QualityCheck -->|High Quality
≥ 0.8| UseARKit[Use ARKit Poses
Skip BA] QualityCheck -->|Low Quality
< 0.8| RunBA[Run BA Validation
Refine Poses] UseARKit --> OracleUncertainty[Compute Oracle Uncertainty
Confidence Maps] RunBA --> OracleUncertainty OracleUncertainty --> SelectTargets[Select Oracle Targets
BA or ARKit Poses] SelectTargets --> Cache[Save to Cache
oracle_targets.npz
uncertainty_results.npz] Cache --> TrainingPhase{Training Phase
Online, Fast} TrainingPhase --> LoadCache[Load Pre-Computed
Oracle Results] LoadCache --> LoadModel[Load/Resume Model
Student + Teacher] LoadModel --> TrainingLoop[Training Loop] TrainingLoop --> Forward[Forward Pass
Student Model Inference] Forward --> ComputeLoss[Compute Geometric Losses
Multi-view: 3.0
Absolute Scale: 2.5
Pose: 2.0
Gradient: 1.0
Teacher: 0.5] ComputeLoss --> Backward[Backward Pass
Gradient Computation] Backward --> ClipGrad[Gradient Clipping
Max Norm: 1.0] ClipGrad --> Update[Update Weights
AdamW Optimizer] Update --> UpdateTeacher[Update Teacher Model
EMA Decay: 0.999] UpdateTeacher --> Scheduler[Update Learning Rate
Cosine Annealing] Scheduler --> Checkpoint{Checkpoint
Interval?} Checkpoint -->|Every N Steps| SaveCheckpoint[Save Checkpoint
Periodic + Best + Latest] Checkpoint -->|Continue| LogMetrics[Log Metrics
W&B / Console] SaveCheckpoint --> LogMetrics LogMetrics --> EpochComplete{Epoch
Complete?} EpochComplete -->|No| TrainingLoop EpochComplete -->|Yes| MoreEpochs{More
Epochs?} MoreEpochs -->|Yes| TrainingLoop MoreEpochs -->|No| SaveFinal[Save Final Checkpoint
Final Model State] SaveFinal --> Evaluate[Evaluate Model
BA Agreement] Evaluate --> Results[Training Results
Metrics & Checkpoints] Results --> Resume{Resume
Training?} Resume -->|Yes| LoadCheckpoint[Load Checkpoint
latest_checkpoint.pt] LoadCheckpoint --> LoadModel Resume -->|No| End([End: Trained Model]) style Preprocess fill:#e1f5ff style TrainingPhase fill:#fff4e1 style ComputeLoss fill:#ffe1f5 style SaveCheckpoint fill:#e1ffe1 style Evaluate fill:#f5e1ff ``` ### Pipeline Stages #### 1. Data Collection & Upload - **Input**: ARKit sequences (video + metadata.json) - **Extract**: Poses, LiDAR depth, camera intrinsics - **Output**: Structured ARKit data #### 2. Pre-Processing Phase (Offline) - **DA3 Inference**: Initial depth/pose predictions (GPU) - **Quality Check**: Evaluate ARKit tracking quality - **BA Validation**: Run only if ARKit quality < threshold (CPU, expensive) - **Oracle Uncertainty**: Compute confidence maps from multiple sources - **Cache Results**: Save oracle targets and uncertainty to disk - **Time**: ~10-20 min per sequence (one-time cost) #### 3. Training Phase (Online) - **Load Cache**: Fast disk I/O of pre-computed results - **Model Loading**: Load or resume from checkpoint (student + teacher) - **Training Loop**: - Forward pass through student model - Compute geometric losses (primary objective) - Backward pass with gradient clipping - Update weights (AdamW optimizer) - Update teacher model (EMA) - Update learning rate (cosine scheduler) - **Checkpointing**: Save periodic, best, and latest checkpoints - **Logging**: Metrics to W&B and console - **Time**: ~1-3 sec per sequence (100-1000x faster than BA) #### 4. Evaluation & Resumption - **Evaluation**: Test model agreement with BA - **Resume**: Load checkpoint to continue training - **Final Model**: Best checkpoint saved for deployment ## Key Features ### 🎯 Unified Training Approach - **Single Training Service**: `ylff/services/ylff_training.py` consolidates all training methods - **DINOv2 Backbone**: Teacher-student paradigm with EMA teacher for stable training - **DA3 Techniques**: Depth-ray representation, multi-resolution training - **Geometric Losses**: Multi-view consistency, absolute scale, pose accuracy as primary objectives ### 📊 Two-Phase Pipeline 1. **Pre-Processing Phase** (offline, expensive) - Compute BA validation and oracle uncertainty - Cache results for fast training iteration - Can be parallelized across sequences 2. **Training Phase** (online, fast) - Load pre-computed oracle results - Train with geometric losses as primary objective - 100-1000x faster than computing BA during training ### 🔧 Core Components - **BA Validation**: Validate model predictions using COLMAP Bundle Adjustment - **ARKit Integration**: Process ARKit data with ground truth poses and LiDAR depth - **Oracle Uncertainty**: Continuous confidence weighting (not binary rejection) - **Geometric Losses**: Multi-view consistency, absolute scale, pose reprojection error - **Unified Training**: Single training service with geometric consistency first ## Installation ### Basic Installation ```bash # Clone repository git clone cd ylff # Create virtual environment python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install package pip install -e . # Install optional dependencies pip install -e ".[gui]" # For GUI visualization ``` ### BA Pipeline Setup For BA validation, you need additional dependencies: ```bash # Install BA pipeline dependencies bash scripts/bin/setup_ba_pipeline.sh # Or manually: pip install pycolmap # Install hloc from source (see docs/SETUP.md) # Install LightGlue from source (see docs/SETUP.md) ``` See `docs/SETUP.md` for detailed installation instructions. ## Quick Start ### 1. Pre-Process ARKit Sequences ```bash # Pre-process ARKit sequences (offline, can run overnight) ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed \ --model-name depth-anything/DA3-LARGE \ --num-workers 8 \ --prefer-arkit-poses ``` This computes BA and oracle uncertainty for all sequences and caches results. ### 2. Train with Unified Service ```bash # Train using pre-computed results (fast iteration) ylff train unified cache/preprocessed \ --model-name depth-anything/DA3-LARGE \ --epochs 200 \ --lr 2e-4 \ --batch-size 32 \ --checkpoint-dir checkpoints \ --use-wandb ``` Or use the Python API: ```python from ylff.services.ylff_training import train_ylff from ylff.services.preprocessed_dataset import PreprocessedARKitDataset # Load preprocessed dataset dataset = PreprocessedARKitDataset( cache_dir="cache/preprocessed", arkit_sequences_dir="data/arkit_sequences", load_images=True, ) # Train with unified service metrics = train_ylff( model=da3_model, dataset=dataset, epochs=200, lr=2e-4, batch_size=32, loss_weights={ 'geometric_consistency': 3.0, # PRIMARY GOAL 'absolute_scale': 2.5, # CRITICAL 'pose_geometric': 2.0, # ESSENTIAL }, use_wandb=True, checkpoint_dir=Path("checkpoints"), ) ``` ### 3. Validate Sequences ```bash # Validate a sequence of images ylff validate sequence path/to/images \ --model-name depth-anything/DA3-LARGE \ --accept-threshold 2.0 \ --reject-threshold 30.0 \ --output results.json ``` ### 4. Evaluate Model ```bash # Evaluate model agreement with BA ylff eval ba-agreement path/to/test/sequences \ --model-name depth-anything/DA3-LARGE \ --checkpoint checkpoints/best_model.pt \ --threshold 2.0 ``` ## Training Approach ### Unified Training Service YLFF uses a **single, unified training service** (`ylff/services/ylff_training.py`) that: 1. **Uses DINOv2's teacher-student paradigm** as the backbone - EMA teacher provides stable targets - Layer-wise learning rate decay - Cosine scheduler with warmup 2. **Incorporates DA3 techniques** - Depth-ray representation (if available) - Multi-resolution training support - Scale normalization 3. **Treats geometric consistency as first-order goal** - Multi-view geometric consistency: **weight 3.0** (PRIMARY) - Absolute scale loss: **weight 2.5** (CRITICAL) - Pose geometric loss: **weight 2.0** (ESSENTIAL) - Gradient loss: **weight 1.0** (DA3 technique) - Teacher-student consistency: **weight 0.5** (STABILITY) ### Experiment Tracking & Ablations YLFF integrates **Weights & Biases (W&B)** for comprehensive experiment tracking and ablation studies: **Logged Configuration** (per run): - Training hyperparameters: `epochs`, `lr`, `batch_size`, `ema_decay` - Loss weights: All component weights (geometric_consistency, absolute_scale, pose_geometric, gradient_loss, teacher_consistency) - Model configuration: Task type, device, precision (FP16/BF16) **Logged Metrics** (per step): - **Loss Components**: All individual loss terms tracked separately - `total_loss`: Overall training loss - `geometric_consistency`: Multi-view consistency loss - `absolute_scale`: Absolute depth scale loss - `pose_geometric`: Pose reprojection error loss - `gradient_loss`: Depth gradient loss - `teacher_consistency`: Teacher-student consistency loss - **Training State**: `step`, `epoch`, `lr` (learning rate over time) **Ablation Study Support**: - **Compare runs**: Filter by hyperparameters (loss weights, learning rate, etc.) - **Track component contributions**: See how each loss component evolves - **Hyperparameter sweeps**: Use W&B sweeps to systematically explore configurations - **Reproducibility**: All hyperparameters logged in config for exact reproduction **Example Ablation Workflow**: ```bash # Run 1: Baseline (default geometric-first weights) ylff train unified cache/preprocessed \ --epochs 200 \ --use-wandb \ --wandb-project ylff-ablations \ --wandb-name baseline-geometric-first # Run 2: Ablation: Lower geometric consistency weight ylff train unified cache/preprocessed \ --epochs 200 \ --use-wandb \ --wandb-project ylff-ablations \ --wandb-name ablation-lower-geo-weight \ --loss-weight-geometric-consistency 1.0 # vs default 3.0 # Run 3: Ablation: No teacher-student consistency ylff train unified cache/preprocessed \ --epochs 200 \ --use-wandb \ --wandb-project ylff-ablations \ --wandb-name ablation-no-teacher \ --loss-weight-teacher-consistency 0.0 # Disable teacher loss # Compare in W&B dashboard: # - Filter by project: "ylff-ablations" # - Compare loss curves across runs # - Analyze which loss components matter most ``` **W&B Dashboard Features**: - **Parallel coordinates plot**: Visualize hyperparameter relationships - **Loss curves**: Compare training dynamics across ablations - **Component analysis**: See contribution of each loss term - **Best run identification**: Automatically identify best configurations ### Suggested Ablation Studies Based on YLFF's architecture, here are key ablation experiments to validate our design choices: #### 1. Loss Weight Ablations (Geometric Consistency First) **Question**: How critical is treating geometric consistency as a first-order goal? ```python from ylff.services.ylff_training import train_ylff from ylff.services.preprocessed_dataset import PreprocessedARKitDataset # Baseline: Geometric-first (default) train_ylff( model=model, dataset=dataset, epochs=200, use_wandb=True, wandb_project="ylff-ablations", loss_weights={ 'geometric_consistency': 3.0, # PRIMARY GOAL 'absolute_scale': 2.5, 'pose_geometric': 2.0, 'gradient_loss': 1.0, 'teacher_consistency': 0.5, }, ) # Ablation 1: Equal weights (traditional approach) train_ylff( model=model, dataset=dataset, epochs=200, use_wandb=True, wandb_project="ylff-ablations", loss_weights={ 'geometric_consistency': 1.0, # Equal weight 'absolute_scale': 1.0, 'pose_geometric': 1.0, 'gradient_loss': 1.0, 'teacher_consistency': 0.5, }, ) # Ablation 2: Perceptual-first (reverse priority) train_ylff( model=model, dataset=dataset, epochs=200, use_wandb=True, wandb_project="ylff-ablations", loss_weights={ 'geometric_consistency': 0.5, # Lower priority 'absolute_scale': 0.5, 'pose_geometric': 0.5, 'gradient_loss': 3.0, # Emphasize smoothness 'teacher_consistency': 0.5, }, ) # Ablation 3: Remove geometric consistency entirely train_ylff( model=model, dataset=dataset, epochs=200, use_wandb=True, wandb_project="ylff-ablations", loss_weights={ 'geometric_consistency': 0.0, # Disabled 'absolute_scale': 2.5, 'pose_geometric': 2.0, 'gradient_loss': 1.0, 'teacher_consistency': 0.5, }, ) ``` **Metrics to Compare**: - Final geometric consistency loss - BA agreement (reprojection error) - Absolute scale accuracy (vs LiDAR) - Multi-view reconstruction quality #### 2. Teacher-Student Ablation **Question**: Does EMA teacher provide training stability and better convergence? ```python # Baseline: With EMA teacher (default ema_decay=0.999) train_ylff( model=model, dataset=dataset, epochs=200, ema_decay=0.999, use_wandb=True, wandb_project="ylff-ablations", ) # Ablation 1: No teacher-student (ema_decay=0.0) train_ylff( model=model, dataset=dataset, epochs=200, ema_decay=0.0, # No EMA updates loss_weights={ 'geometric_consistency': 3.0, 'absolute_scale': 2.5, 'pose_geometric': 2.0, 'gradient_loss': 1.0, 'teacher_consistency': 0.0, # Disable teacher loss }, use_wandb=True, wandb_project="ylff-ablations", ) # Ablation 2: Faster teacher updates (ema_decay=0.99) train_ylff( model=model, dataset=dataset, epochs=200, ema_decay=0.99, # Faster updates use_wandb=True, wandb_project="ylff-ablations", ) # Ablation 3: Slower teacher updates (ema_decay=0.9999) train_ylff( model=model, dataset=dataset, epochs=200, ema_decay=0.9999, # Slower updates use_wandb=True, wandb_project="ylff-ablations", ) ``` **Metrics to Compare**: - Training stability (loss variance) - Convergence speed - Final model quality - Teacher-student consistency loss #### 3. Oracle Source Ablation (BA vs ARKit) **Question**: How much does BA refinement improve over ARKit poses? ```bash # Baseline: Use BA when ARKit quality < 0.8 (default) ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed-ba \ --prefer-arkit-poses --min-arkit-quality 0.8 ylff train unified cache/preprocessed-ba \ --use-wandb --wandb-project ylff-ablations # Ablation 1: Always use ARKit (no BA, faster preprocessing) ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed-arkit-only \ --prefer-arkit-poses --min-arkit-quality 0.0 ylff train unified cache/preprocessed-arkit-only \ --use-wandb --wandb-project ylff-ablations # Ablation 2: Always use BA (expensive but highest quality) ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed-ba-always \ --prefer-arkit-poses --min-arkit-quality 1.0 # Never use ARKit ylff train unified cache/preprocessed-ba-always \ --use-wandb --wandb-project ylff-ablations ``` **Metrics to Compare**: - Pose accuracy (reprojection error) - Training data quality (confidence scores) - Final model performance - Preprocessing time cost #### 4. Uncertainty Weighting Ablation **Question**: Does confidence-weighted loss improve training vs uniform weighting? ```bash # Baseline: With uncertainty weighting (default) # Uses depth_confidence and pose_confidence from preprocessing # Ablation: Uniform weighting (ignore uncertainty) # Modify preprocessing to set all confidence = 1.0 # Or modify loss computation to ignore confidence maps ``` **Metrics to Compare**: - Loss on high-confidence vs low-confidence regions - Model performance on uncertain scenes - Training stability #### 5. Multi-View Consistency Ablation **Question**: How many views are needed for effective geometric consistency? ```python # Baseline: Variable views (2-18, default from dataset) train_ylff( model=model, dataset=dataset, # Uses all available views epochs=200, use_wandb=True, wandb_project="ylff-ablations", ) # Ablation 1: Single view only (disable geometric consistency) train_ylff( model=model, dataset=single_view_dataset, # Modified dataset with 1 view epochs=200, loss_weights={ 'geometric_consistency': 0.0, # Disabled (needs 2+ views) 'absolute_scale': 2.5, 'pose_geometric': 2.0, 'gradient_loss': 1.0, 'teacher_consistency': 0.5, }, use_wandb=True, wandb_project="ylff-ablations", ) # Ablation 2-4: Fixed N views # Modify dataset to sample exactly N views per sequence # Compare: 2 views, 5 views, 10 views, 18 views ``` **Metrics to Compare**: - Geometric consistency loss - Multi-view reconstruction accuracy - Training efficiency (more views = slower) #### 6. DA3 Techniques Ablation **Question**: Which DA3 techniques contribute most? ```python # Baseline: All DA3 techniques enabled train_ylff( model=model, dataset=dataset, epochs=200, use_wandb=True, wandb_project="ylff-ablations", ) # Ablation 1: No gradient loss (DA3 edge preservation) train_ylff( model=model, dataset=dataset, epochs=200, loss_weights={ 'geometric_consistency': 3.0, 'absolute_scale': 2.5, 'pose_geometric': 2.0, 'gradient_loss': 0.0, # Disabled 'teacher_consistency': 0.5, }, use_wandb=True, wandb_project="ylff-ablations", ) # Ablation 2: No depth-ray representation # Use model that outputs separate depth + poses instead of depth-ray # (Requires different model architecture) # Ablation 3: Fixed resolution (no multi-resolution training) # Modify dataset to use fixed resolution instead of variable ``` **Metrics to Compare**: - Depth edge quality (gradient loss ablation) - Training efficiency (multi-resolution ablation) - Model generalization #### 7. Preprocessing Phase Ablation **Question**: How much does the two-phase pipeline improve training efficiency? ```bash # Baseline: With preprocessing (fast training) ylff preprocess arkit data/arkit_sequences --output-cache cache/preprocessed ylff train unified cache/preprocessed \ --use-wandb --wandb-project ylff-ablations \ --wandb-name baseline-with-preprocessing # Ablation: Live BA during training (slow but no preprocessing) # This would require modifying training to compute BA on-the-fly # Compare: Training time per epoch, total training time ``` **Metrics to Compare**: - Training time per epoch - Total training time - Model quality (should be similar, preprocessing is just optimization) #### 8. Loss Component Contribution Analysis **Question**: Which loss component contributes most to final model quality? Run systematic sweeps using W&B sweeps or Python script: ```python # sweep_config.yaml program: train_ablation_sweep.py method: grid parameters: loss_weight_geometric_consistency: values: [0.0, 1.0, 2.0, 3.0, 4.0] loss_weight_absolute_scale: values: [0.0, 1.0, 2.0, 2.5, 3.0] loss_weight_pose_geometric: values: [0.0, 1.0, 2.0, 3.0] loss_weight_gradient_loss: values: [0.0, 0.5, 1.0, 1.5] loss_weight_teacher_consistency: values: [0.0, 0.25, 0.5, 0.75, 1.0] # train_ablation_sweep.py import wandb from ylff.services.ylff_training import train_ylff wandb.init() config = wandb.config train_ylff( model=model, dataset=dataset, epochs=200, loss_weights={ 'geometric_consistency': config.loss_weight_geometric_consistency, 'absolute_scale': config.loss_weight_absolute_scale, 'pose_geometric': config.loss_weight_pose_geometric, 'gradient_loss': config.loss_weight_gradient_loss, 'teacher_consistency': config.loss_weight_teacher_consistency, }, use_wandb=True, wandb_project="ylff-ablations", ) # Run: wandb sweep sweep_config.yaml ``` **Analysis**: - Use W&B parallel coordinates plot to find optimal weight combinations - Identify which components are essential vs optional - Find Pareto frontier (best quality for given training time) #### Recommended Ablation Order 1. **Start with Loss Weight Ablations** (#1) - Most fundamental to our approach 2. **Teacher-Student Ablation** (#2) - Validates DINOv2 adaptation 3. **Oracle Source Ablation** (#3) - Validates preprocessing strategy 4. **Component Contribution** (#8) - Systematic analysis 5. **DA3 Techniques** (#6) - Validates DA3 integration 6. **Multi-View Consistency** (#5) - Optimizes training efficiency 7. **Uncertainty Weighting** (#4) - Fine-tuning 8. **Preprocessing Phase** (#7) - Efficiency validation Each ablation should be run with: - Same random seed (for reproducibility) - Same dataset split - Same number of epochs - W&B tracking enabled for easy comparison ## Training Datasets Depth Anything 3 (DA3) was trained exclusively on **public academic datasets**. The following table documents all datasets used in DA3 training, their sources, and availability status for YLFF: | Dataset | # Scenes | Data Type | Source / URL | YLFF Status | Notes | | ------------------------------------ | -------- | --------- | ----------------------------------------------------------------------------------------------- | ---------------- | ------------------------------ | | **Synthetic Datasets** | | AriaDigitalTwin | 237 | Synthetic | [Aria Digital Twin](https://github.com/facebookresearch/AriaDigitalTwin) | ❌ Not Available | Meta's AR dataset | | AriaSyntheticENV | 99,950 | Synthetic | [Aria Synthetic](https://github.com/facebookresearch/AriaDigitalTwin) | ❌ Not Available | Large-scale synthetic AR | | HyperSim | 344 | Synthetic | [HyperSim](https://github.com/apple/ml-hypersim) | ❌ Not Available | Apple's photorealistic dataset | | MegaSynth | 6,049 | Synthetic | Unknown | ❓ To Verify | Synthetic multi-view | | MvsSynth | 121 | Synthetic | Unknown | ❓ To Verify | Multi-view stereo synthetic | | Objaverse | 505,557 | Synthetic | [Objaverse](https://objaverse.allenai.org/) | ❓ To Verify | Large-scale 3D objects | | Omniobject | 5,885 | Synthetic | [OmniObject3D](https://omniobject3d.github.io/) | ❓ To Verify | Object-centric dataset | | OmniWorld | 1,039 | Synthetic | [OmniWorld](https://arxiv.org/abs/2509.12201) | ❓ To Verify | Multi-domain dataset | | PointOdyssey | 44 | Synthetic | [PointOdyssey](https://pointodyssey.com/) | ❓ To Verify | Long-term point tracking | | ReplicaVMAP | 17 | Synthetic | [Replica](https://github.com/facebookresearch/Replica-Dataset) | ❓ To Verify | Indoor scene dataset | | ScenenetRGBD | 16,866 | Synthetic | [SceneNet RGB-D](https://robotvault.bitbucket.io/scenenet-rgbd.html) | ❓ To Verify | Indoor RGB-D scenes | | TartanAir | 355 | Synthetic | [TartanAir](https://theairlab.org/tartanair-dataset/) | ❓ To Verify | Large-scale simulation | | Trellis | 557,408 | Synthetic | Unknown | ❓ To Verify | Large-scale synthetic | | vKitti2 | 50 | Synthetic | [vKITTI2](https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds-vkitti-2/) | ❓ To Verify | Virtual KITTI | | **Real-World Datasets (LiDAR)** | | ARKitScenes | 4,388 | LiDAR | [ARKitScenes](https://github.com/apple/ARKitScenes) | ✅ **Available** | **Primary dataset for YLFF** | | ScanNet++ | 230 | LiDAR | [ScanNet++](https://github.com/ScanNet/ScanNetPlusPlus) | ❓ To Verify | High-fidelity indoor | | WildRGBD | 23,050 | LiDAR | [WildRGBD](https://wildrgbd.github.io/) | ❓ To Verify | Large-scale RGB-D | | **Real-World Datasets (COLMAP/SfM)** | | BlendedMVS | 503 | 3D Recon | [BlendedMVS](https://github.com/YoYo000/BlendedMVS) | ❓ To Verify | Multi-view stereo | | Co3dv2 | 30,616 | COLMAP | [Common Objects in 3D](https://github.com/facebookresearch/co3d) | ❓ To Verify | Object-centric | | DL3DV | 6,379 | COLMAP | [DL3DV-10K](https://github.com/OpenGVLab/DL3DV) | ❓ To Verify | Large-scale 3D vision | | MapFree | 921 | COLMAP | [Map-free Visual Relocalization](https://github.com/nianticlabs/map-free-reloc) | ❓ To Verify | Visual relocalization | | MegaDepth | 268 | COLMAP | [MegaDepth](https://www.cs.cornell.edu/projects/megadepth/) | ❓ To Verify | Internet photos | **Legend:** - ✅ **Available**: Dataset is accessible and can be used for YLFF training - ❌ **Not Available**: Dataset is not accessible (proprietary, requires special access, etc.) - ❓ **To Verify**: Dataset availability needs to be confirmed ### Dataset Statistics **Total Training Data:** - **Synthetic**: ~1,093,000 scenes (majority from Objaverse and Trellis) - **Real-World LiDAR**: ~27,668 scenes (ARKitScenes, ScanNet++, WildRGBD) - **Real-World COLMAP**: ~38,687 scenes (BlendedMVS, Co3dv2, DL3DV, MapFree, MegaDepth) - **Total**: ~1,159,355 scenes **Data Type Distribution:** - **Synthetic**: 94.3% (provides high-quality dense depth) - **LiDAR**: 2.4% (provides metric accuracy) - **COLMAP/SfM**: 3.3% (provides multi-view geometry) ### YLFF Dataset Strategy YLFF currently focuses on **ARKitScenes** as the primary training dataset because: 1. ✅ **Available**: Publicly accessible dataset 2. ✅ **High Quality**: LiDAR depth provides metric accuracy 3. ✅ **Real-World**: Captures real indoor scenes with natural variations 4. ✅ **Rich Metadata**: Includes poses, intrinsics, and LiDAR depth 5. ✅ **Large Scale**: 4,388 scenes provide substantial training data **Future Dataset Integration:** - Priority: ScanNet++, WildRGBD (LiDAR datasets for metric accuracy) - Secondary: DL3DV, Co3dv2 (COLMAP datasets for multi-view geometry) - Synthetic: Consider for teacher model training (if accessible) ### Dataset Access Notes - **ARKitScenes**: Download from [official repository](https://github.com/apple/ARKitScenes) - **ScanNet++**: Requires registration and approval - **COLMAP datasets**: Most are publicly available but may require preprocessing - **Synthetic datasets**: Many require special access or are proprietary For detailed dataset preparation and preprocessing instructions, see `docs/DATASET_PREPARATION.md` (to be created). ### Loss Components The training uses geometric losses as the primary objective: 1. **Multi-View Geometric Consistency** (weight: 3.0) - Enforces that the same 3D point projects correctly across views - Uses back-projection + projection across multiple views - **This is treated as a first-order objective, not regularization** 2. **Absolute Scale Loss** (weight: 2.5) - Direct supervision from LiDAR/BA depth - Enforces correct absolute depth values in meters - Critical for metric accuracy 3. **Pose Geometric Loss** (weight: 2.0) - Reprojection error using predicted poses - Enforces geometric consistency between poses and depth - Multi-view pose consistency is paramount 4. **Gradient Loss** (weight: 1.0) - Preserves sharp depth boundaries - Ensures smoothness in planar regions - DA3 technique for better depth quality 5. **Teacher-Student Consistency** (weight: 0.5) - L1 loss between student and teacher predictions - Encourages stable training - Prevents student from diverging ## Project Structure ``` ylff/ ├── ylff/ # Main package │ ├── services/ # Business logic │ │ ├── ylff_training.py # ⭐ Unified training service │ │ ├── preprocessing.py # Offline preprocessing (BA, uncertainty) │ │ ├── preprocessed_dataset.py # Dataset for pre-computed results │ │ ├── ba_validator.py # BA validation pipeline │ │ ├── arkit_processor.py # ARKit data processing │ │ ├── evaluate.py # Evaluation metrics │ │ └── ... # Other services │ │ │ ├── utils/ # Utilities │ │ ├── geometric_losses.py # Geometric loss functions │ │ ├── oracle_uncertainty.py # Oracle uncertainty propagation │ │ ├── oracle_losses.py # Oracle-weighted losses │ │ └── ... # Other utilities │ │ │ ├── routers/ # FastAPI route handlers │ ├── models/ # Pydantic API models │ └── cli.py # Command-line interface │ ├── configs/ # Configuration files │ ├── dinov2_train_config.yaml # Training configuration │ └── ba_config.yaml # BA pipeline configuration │ ├── docs/ # Documentation │ ├── UNIFIED_TRAINING.md # Unified training guide │ ├── TRAINING_PIPELINE_ARCHITECTURE.md │ └── ... # Other documentation │ └── research_docs/ # Research documentation └── MODEL_ARCH.md # Model architecture details ``` ## CLI Commands ### Preprocessing - `ylff preprocess arkit ` - Pre-process ARKit sequences (offline) ### Training - `ylff train unified ` - Train using unified training service ### Validation - `ylff validate sequence ` - Validate a single sequence - `ylff validate arkit [--gui]` - Validate ARKit data (with optional GUI) ### Evaluation - `ylff eval ba-agreement ` - Evaluate model agreement with BA ### Visualization - `ylff visualize ` - Generate static visualizations ## Complete Workflow ### Step 1: Pre-Process All Sequences ```bash # Pre-process all ARKit sequences (one-time, can run overnight) ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed \ --model-name depth-anything/DA3-LARGE \ --num-workers 8 \ --prefer-arkit-poses \ --use-lidar ``` This: - Extracts ARKit data (poses, LiDAR depth) - FREE - Runs DA3 inference (GPU, batchable) - Runs BA only for sequences with poor ARKit tracking - Computes oracle uncertainty - Saves everything to cache ### Step 2: Train with Unified Service ```bash # Train using pre-computed results (fast iteration) ylff train unified cache/preprocessed \ --model-name depth-anything/DA3-LARGE \ --epochs 200 \ --lr 2e-4 \ --batch-size 32 \ --checkpoint-dir checkpoints \ --use-wandb \ --wandb-project ylff-training ``` This: - Loads pre-computed oracle results (fast, disk I/O) - Runs DA3 inference (current model, GPU) - Computes geometric losses (primary objective) - Updates model weights with teacher-student learning ### Step 3: Evaluate ```bash # Evaluate fine-tuned model ylff eval ba-agreement data/test \ --checkpoint checkpoints/best_model.pt ``` ## Configuration Configuration files are in `configs/`: - `dinov2_train_config.yaml` - Unified training configuration - Optimizer settings (DINOv2 style) - Loss weights (geometric consistency first) - Teacher-student settings - Multi-resolution and multi-view training - `ba_config.yaml` - BA pipeline settings ## Documentation - **Unified Training**: `docs/UNIFIED_TRAINING.md` - Complete guide to unified training - **Training Pipeline**: `docs/TRAINING_PIPELINE_ARCHITECTURE.md` - Two-phase pipeline architecture - **Model Architecture**: `research_docs/MODEL_ARCH.md` - Detailed architecture and training approach - **API Documentation**: `docs/API.md` - API reference - **ARKit Integration**: `docs/ARKIT_INTEGRATION.md` - ARKit data processing ## Key Design Decisions ### Why Geometric Consistency First? Traditional depth estimation models prioritize perceptual quality (how realistic the depth looks) over geometric accuracy (how accurate the absolute scale and multi-view consistency are). YLFF reverses this priority: - **Geometric consistency** ensures that the same 3D point projects correctly across views - **Absolute scale** ensures metric accuracy (depth in meters, not just relative) - **Pose consistency** ensures that predicted poses align with depth predictions This approach is essential for applications requiring accurate 3D reconstruction, SLAM, and metric depth estimation. ### Why Two-Phase Pipeline? BA computation is expensive (5-15 minutes per sequence) and cannot run during training. The two-phase pipeline: 1. **Pre-processing** (offline): Compute BA once, cache results 2. **Training** (online): Load cached results, train fast This enables 100-1000x faster training iteration while still using BA as supervision. ### Why Teacher-Student Learning? DINOv2's teacher-student paradigm provides: - **Stability**: EMA teacher prevents training instability - **Better convergence**: Teacher provides stable targets - **Scalability**: Works well with large-scale training ## Development ### Running Tests ```bash # Basic smoke test python scripts/tests/smoke_test_basic.py # GUI test python scripts/tests/test_gui_simple.py ``` ### Code Quality ```bash # Format code black ylff/ scripts/ # Sort imports isort ylff/ scripts/ # Type checking mypy ylff/ ``` ## Dependencies ### Core Dependencies - PyTorch >= 2.0 - NumPy < 2.0 - OpenCV - pycolmap >= 0.4.0 - Typer (for CLI) ### Optional Dependencies - **GUI**: Plotly (for interactive 3D plots) - **BA Pipeline**: hloc, LightGlue (installed from source) - **Training**: Weights & Biases (for experiment tracking) See `pyproject.toml` for complete dependency list. ## License Apache-2.0 ## Citation If you use YLFF in your research, please cite: ```bibtex @software{ylff2024, title={You Learn From Failure: Geometric Consistency First Training for Visual Geometry}, author={YLFF Contributors}, year={2024}, url={https://github.com/your-org/ylff} } ``` ## References - **DINOv2**: https://github.com/facebookresearch/dinov2 - **DA3 Paper**: Depth Anything 3 (arXiv:2511.10647) - **Unified Training**: `ylff/services/ylff_training.py` - **Model Architecture**: `research_docs/MODEL_ARCH.md`