| | --- |
| | title: YLFF Training |
| | emoji: 🚀 |
| | colorFrom: blue |
| | colorTo: purple |
| | sdk: docker |
| | app_port: 7860 |
| | --- |
| | |
| | # You Learn From Failure (YLFF) |
| |
|
| | **Geometric Consistency First: Training Visual Geometry Models with BA Supervision** |
| |
|
| | ## Overview |
| |
|
| | YLFF is a unified framework for training geometrically accurate depth estimation models using Bundle Adjustment (BA) and LiDAR as oracle teachers. Unlike traditional approaches that prioritize perceptual quality, YLFF treats **geometric consistency as a first-order goal**. |
| |
|
| | ### Core Philosophy |
| |
|
| | **Geometric Accuracy > Perceptual Quality** |
| |
|
| | - Multi-view geometric consistency is the **primary objective** (not just regularization) |
| | - Absolute scale accuracy is **critical** for metric depth estimation |
| | - Multi-view pose consistency is **essential** for 3D reconstruction |
| | - Teacher-student learning provides **stability** during training |
| |
|
| | ## End-to-End Pipeline |
| |
|
| | The complete YLFF pipeline from data collection to trained model: |
| |
|
| | ```mermaid |
| | flowchart TD |
| | Start([Start: Data Collection]) --> Upload[Upload ARKit Sequences] |
| | Upload --> Extract[Extract ARKit Data<br/>Poses, LiDAR, Intrinsics] |
| | |
| | Extract --> Preprocess{Pre-Processing Phase<br/>Offline, Expensive} |
| | |
| | Preprocess --> DA3Infer[Run DA3 Inference<br/>Initial Predictions] |
| | DA3Infer --> QualityCheck{ARKit Quality<br/>Check} |
| | |
| | QualityCheck -->|High Quality<br/>≥ 0.8| UseARKit[Use ARKit Poses<br/>Skip BA] |
| | QualityCheck -->|Low Quality<br/>< 0.8| RunBA[Run BA Validation<br/>Refine Poses] |
| | |
| | UseARKit --> OracleUncertainty[Compute Oracle Uncertainty<br/>Confidence Maps] |
| | RunBA --> OracleUncertainty |
| | |
| | OracleUncertainty --> SelectTargets[Select Oracle Targets<br/>BA or ARKit Poses] |
| | SelectTargets --> Cache[Save to Cache<br/>oracle_targets.npz<br/>uncertainty_results.npz] |
| | |
| | Cache --> TrainingPhase{Training Phase<br/>Online, Fast} |
| | |
| | TrainingPhase --> LoadCache[Load Pre-Computed<br/>Oracle Results] |
| | LoadCache --> LoadModel[Load/Resume Model<br/>Student + Teacher] |
| | |
| | LoadModel --> TrainingLoop[Training Loop] |
| | |
| | TrainingLoop --> Forward[Forward Pass<br/>Student Model Inference] |
| | Forward --> ComputeLoss[Compute Geometric Losses<br/>Multi-view: 3.0<br/>Absolute Scale: 2.5<br/>Pose: 2.0<br/>Gradient: 1.0<br/>Teacher: 0.5] |
| | |
| | ComputeLoss --> Backward[Backward Pass<br/>Gradient Computation] |
| | Backward --> ClipGrad[Gradient Clipping<br/>Max Norm: 1.0] |
| | ClipGrad --> Update[Update Weights<br/>AdamW Optimizer] |
| | |
| | Update --> UpdateTeacher[Update Teacher Model<br/>EMA Decay: 0.999] |
| | UpdateTeacher --> Scheduler[Update Learning Rate<br/>Cosine Annealing] |
| | |
| | Scheduler --> Checkpoint{Checkpoint<br/>Interval?} |
| | |
| | Checkpoint -->|Every N Steps| SaveCheckpoint[Save Checkpoint<br/>Periodic + Best + Latest] |
| | Checkpoint -->|Continue| LogMetrics[Log Metrics<br/>W&B / Console] |
| | |
| | SaveCheckpoint --> LogMetrics |
| | LogMetrics --> EpochComplete{Epoch<br/>Complete?} |
| | |
| | EpochComplete -->|No| TrainingLoop |
| | EpochComplete -->|Yes| MoreEpochs{More<br/>Epochs?} |
| | |
| | MoreEpochs -->|Yes| TrainingLoop |
| | MoreEpochs -->|No| SaveFinal[Save Final Checkpoint<br/>Final Model State] |
| | |
| | SaveFinal --> Evaluate[Evaluate Model<br/>BA Agreement] |
| | Evaluate --> Results[Training Results<br/>Metrics & Checkpoints] |
| | |
| | Results --> Resume{Resume<br/>Training?} |
| | Resume -->|Yes| LoadCheckpoint[Load Checkpoint<br/>latest_checkpoint.pt] |
| | LoadCheckpoint --> LoadModel |
| | Resume -->|No| End([End: Trained Model]) |
| | |
| | style Preprocess fill:#e1f5ff |
| | style TrainingPhase fill:#fff4e1 |
| | style ComputeLoss fill:#ffe1f5 |
| | style SaveCheckpoint fill:#e1ffe1 |
| | style Evaluate fill:#f5e1ff |
| | ``` |
| |
|
| | ### Pipeline Stages |
| |
|
| | #### 1. Data Collection & Upload |
| |
|
| | - **Input**: ARKit sequences (video + metadata.json) |
| | - **Extract**: Poses, LiDAR depth, camera intrinsics |
| | - **Output**: Structured ARKit data |
| |
|
| | #### 2. Pre-Processing Phase (Offline) |
| |
|
| | - **DA3 Inference**: Initial depth/pose predictions (GPU) |
| | - **Quality Check**: Evaluate ARKit tracking quality |
| | - **BA Validation**: Run only if ARKit quality < threshold (CPU, expensive) |
| | - **Oracle Uncertainty**: Compute confidence maps from multiple sources |
| | - **Cache Results**: Save oracle targets and uncertainty to disk |
| | - **Time**: ~10-20 min per sequence (one-time cost) |
| |
|
| | #### 3. Training Phase (Online) |
| |
|
| | - **Load Cache**: Fast disk I/O of pre-computed results |
| | - **Model Loading**: Load or resume from checkpoint (student + teacher) |
| | - **Training Loop**: |
| | - Forward pass through student model |
| | - Compute geometric losses (primary objective) |
| | - Backward pass with gradient clipping |
| | - Update weights (AdamW optimizer) |
| | - Update teacher model (EMA) |
| | - Update learning rate (cosine scheduler) |
| | - **Checkpointing**: Save periodic, best, and latest checkpoints |
| | - **Logging**: Metrics to W&B and console |
| | - **Time**: ~1-3 sec per sequence (100-1000x faster than BA) |
| |
|
| | #### 4. Evaluation & Resumption |
| |
|
| | - **Evaluation**: Test model agreement with BA |
| | - **Resume**: Load checkpoint to continue training |
| | - **Final Model**: Best checkpoint saved for deployment |
| |
|
| | ## Key Features |
| |
|
| | ### 🎯 Unified Training Approach |
| |
|
| | - **Single Training Service**: `ylff/services/ylff_training.py` consolidates all training methods |
| | - **DINOv2 Backbone**: Teacher-student paradigm with EMA teacher for stable training |
| | - **DA3 Techniques**: Depth-ray representation, multi-resolution training |
| | - **Geometric Losses**: Multi-view consistency, absolute scale, pose accuracy as primary objectives |
| |
|
| | ### 📊 Two-Phase Pipeline |
| |
|
| | 1. **Pre-Processing Phase** (offline, expensive) |
| |
|
| | - Compute BA validation and oracle uncertainty |
| | - Cache results for fast training iteration |
| | - Can be parallelized across sequences |
| |
|
| | 2. **Training Phase** (online, fast) |
| | - Load pre-computed oracle results |
| | - Train with geometric losses as primary objective |
| | - 100-1000x faster than computing BA during training |
| |
|
| | ### 🔧 Core Components |
| |
|
| | - **BA Validation**: Validate model predictions using COLMAP Bundle Adjustment |
| | - **ARKit Integration**: Process ARKit data with ground truth poses and LiDAR depth |
| | - **Oracle Uncertainty**: Continuous confidence weighting (not binary rejection) |
| | - **Geometric Losses**: Multi-view consistency, absolute scale, pose reprojection error |
| | - **Unified Training**: Single training service with geometric consistency first |
| |
|
| | ## Installation |
| |
|
| | ### Basic Installation |
| |
|
| | ```bash |
| | # Clone repository |
| | git clone <repository-url> |
| | cd ylff |
| | |
| | # Create virtual environment |
| | python -m venv .venv |
| | source .venv/bin/activate # On Windows: .venv\Scripts\activate |
| | |
| | # Install package |
| | pip install -e . |
| | |
| | # Install optional dependencies |
| | pip install -e ".[gui]" # For GUI visualization |
| | ``` |
| |
|
| | ### BA Pipeline Setup |
| |
|
| | For BA validation, you need additional dependencies: |
| |
|
| | ```bash |
| | # Install BA pipeline dependencies |
| | bash scripts/bin/setup_ba_pipeline.sh |
| | |
| | # Or manually: |
| | pip install pycolmap |
| | # Install hloc from source (see docs/SETUP.md) |
| | # Install LightGlue from source (see docs/SETUP.md) |
| | ``` |
| |
|
| | See `docs/SETUP.md` for detailed installation instructions. |
| |
|
| | ## Quick Start |
| |
|
| | ### 1. Pre-Process ARKit Sequences |
| |
|
| | ```bash |
| | # Pre-process ARKit sequences (offline, can run overnight) |
| | ylff preprocess arkit data/arkit_sequences \ |
| | --output-cache cache/preprocessed \ |
| | --model-name depth-anything/DA3-LARGE \ |
| | --num-workers 8 \ |
| | --prefer-arkit-poses |
| | ``` |
| |
|
| | This computes BA and oracle uncertainty for all sequences and caches results. |
| |
|
| | ### 2. Train with Unified Service |
| |
|
| | ```bash |
| | # Train using pre-computed results (fast iteration) |
| | ylff train unified cache/preprocessed \ |
| | --model-name depth-anything/DA3-LARGE \ |
| | --epochs 200 \ |
| | --lr 2e-4 \ |
| | --batch-size 32 \ |
| | --checkpoint-dir checkpoints \ |
| | --use-wandb |
| | ``` |
| |
|
| | Or use the Python API: |
| |
|
| | ```python |
| | from ylff.services.ylff_training import train_ylff |
| | from ylff.services.preprocessed_dataset import PreprocessedARKitDataset |
| | |
| | # Load preprocessed dataset |
| | dataset = PreprocessedARKitDataset( |
| | cache_dir="cache/preprocessed", |
| | arkit_sequences_dir="data/arkit_sequences", |
| | load_images=True, |
| | ) |
| | |
| | # Train with unified service |
| | metrics = train_ylff( |
| | model=da3_model, |
| | dataset=dataset, |
| | epochs=200, |
| | lr=2e-4, |
| | batch_size=32, |
| | loss_weights={ |
| | 'geometric_consistency': 3.0, # PRIMARY GOAL |
| | 'absolute_scale': 2.5, # CRITICAL |
| | 'pose_geometric': 2.0, # ESSENTIAL |
| | }, |
| | use_wandb=True, |
| | checkpoint_dir=Path("checkpoints"), |
| | ) |
| | ``` |
| |
|
| | ### 3. Validate Sequences |
| |
|
| | ```bash |
| | # Validate a sequence of images |
| | ylff validate sequence path/to/images \ |
| | --model-name depth-anything/DA3-LARGE \ |
| | --accept-threshold 2.0 \ |
| | --reject-threshold 30.0 \ |
| | --output results.json |
| | ``` |
| |
|
| | ### 4. Evaluate Model |
| |
|
| | ```bash |
| | # Evaluate model agreement with BA |
| | ylff eval ba-agreement path/to/test/sequences \ |
| | --model-name depth-anything/DA3-LARGE \ |
| | --checkpoint checkpoints/best_model.pt \ |
| | --threshold 2.0 |
| | ``` |
| |
|
| | ## Training Approach |
| |
|
| | ### Unified Training Service |
| |
|
| | YLFF uses a **single, unified training service** (`ylff/services/ylff_training.py`) that: |
| |
|
| | 1. **Uses DINOv2's teacher-student paradigm** as the backbone |
| |
|
| | - EMA teacher provides stable targets |
| | - Layer-wise learning rate decay |
| | - Cosine scheduler with warmup |
| |
|
| | 2. **Incorporates DA3 techniques** |
| |
|
| | - Depth-ray representation (if available) |
| | - Multi-resolution training support |
| | - Scale normalization |
| |
|
| | 3. **Treats geometric consistency as first-order goal** |
| | - Multi-view geometric consistency: **weight 3.0** (PRIMARY) |
| | - Absolute scale loss: **weight 2.5** (CRITICAL) |
| | - Pose geometric loss: **weight 2.0** (ESSENTIAL) |
| | - Gradient loss: **weight 1.0** (DA3 technique) |
| | - Teacher-student consistency: **weight 0.5** (STABILITY) |
| |
|
| | ### Experiment Tracking & Ablations |
| |
|
| | YLFF integrates **Weights & Biases (W&B)** for comprehensive experiment tracking and ablation studies: |
| |
|
| | **Logged Configuration** (per run): |
| |
|
| | - Training hyperparameters: `epochs`, `lr`, `batch_size`, `ema_decay` |
| | - Loss weights: All component weights (geometric_consistency, absolute_scale, pose_geometric, gradient_loss, teacher_consistency) |
| | - Model configuration: Task type, device, precision (FP16/BF16) |
| | |
| | **Logged Metrics** (per step): |
| | |
| | - **Loss Components**: All individual loss terms tracked separately |
| | - `total_loss`: Overall training loss |
| | - `geometric_consistency`: Multi-view consistency loss |
| | - `absolute_scale`: Absolute depth scale loss |
| | - `pose_geometric`: Pose reprojection error loss |
| | - `gradient_loss`: Depth gradient loss |
| | - `teacher_consistency`: Teacher-student consistency loss |
| | - **Training State**: `step`, `epoch`, `lr` (learning rate over time) |
| |
|
| | **Ablation Study Support**: |
| |
|
| | - **Compare runs**: Filter by hyperparameters (loss weights, learning rate, etc.) |
| | - **Track component contributions**: See how each loss component evolves |
| | - **Hyperparameter sweeps**: Use W&B sweeps to systematically explore configurations |
| | - **Reproducibility**: All hyperparameters logged in config for exact reproduction |
| |
|
| | **Example Ablation Workflow**: |
| |
|
| | ```bash |
| | # Run 1: Baseline (default geometric-first weights) |
| | ylff train unified cache/preprocessed \ |
| | --epochs 200 \ |
| | --use-wandb \ |
| | --wandb-project ylff-ablations \ |
| | --wandb-name baseline-geometric-first |
| | |
| | # Run 2: Ablation: Lower geometric consistency weight |
| | ylff train unified cache/preprocessed \ |
| | --epochs 200 \ |
| | --use-wandb \ |
| | --wandb-project ylff-ablations \ |
| | --wandb-name ablation-lower-geo-weight \ |
| | --loss-weight-geometric-consistency 1.0 # vs default 3.0 |
| | |
| | # Run 3: Ablation: No teacher-student consistency |
| | ylff train unified cache/preprocessed \ |
| | --epochs 200 \ |
| | --use-wandb \ |
| | --wandb-project ylff-ablations \ |
| | --wandb-name ablation-no-teacher \ |
| | --loss-weight-teacher-consistency 0.0 # Disable teacher loss |
| | |
| | # Compare in W&B dashboard: |
| | # - Filter by project: "ylff-ablations" |
| | # - Compare loss curves across runs |
| | # - Analyze which loss components matter most |
| | ``` |
| |
|
| | **W&B Dashboard Features**: |
| |
|
| | - **Parallel coordinates plot**: Visualize hyperparameter relationships |
| | - **Loss curves**: Compare training dynamics across ablations |
| | - **Component analysis**: See contribution of each loss term |
| | - **Best run identification**: Automatically identify best configurations |
| |
|
| | ### Suggested Ablation Studies |
| |
|
| | Based on YLFF's architecture, here are key ablation experiments to validate our design choices: |
| |
|
| | #### 1. Loss Weight Ablations (Geometric Consistency First) |
| |
|
| | **Question**: How critical is treating geometric consistency as a first-order goal? |
| |
|
| | ```python |
| | from ylff.services.ylff_training import train_ylff |
| | from ylff.services.preprocessed_dataset import PreprocessedARKitDataset |
| | |
| | # Baseline: Geometric-first (default) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | loss_weights={ |
| | 'geometric_consistency': 3.0, # PRIMARY GOAL |
| | 'absolute_scale': 2.5, |
| | 'pose_geometric': 2.0, |
| | 'gradient_loss': 1.0, |
| | 'teacher_consistency': 0.5, |
| | }, |
| | ) |
| | |
| | # Ablation 1: Equal weights (traditional approach) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | loss_weights={ |
| | 'geometric_consistency': 1.0, # Equal weight |
| | 'absolute_scale': 1.0, |
| | 'pose_geometric': 1.0, |
| | 'gradient_loss': 1.0, |
| | 'teacher_consistency': 0.5, |
| | }, |
| | ) |
| | |
| | # Ablation 2: Perceptual-first (reverse priority) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | loss_weights={ |
| | 'geometric_consistency': 0.5, # Lower priority |
| | 'absolute_scale': 0.5, |
| | 'pose_geometric': 0.5, |
| | 'gradient_loss': 3.0, # Emphasize smoothness |
| | 'teacher_consistency': 0.5, |
| | }, |
| | ) |
| | |
| | # Ablation 3: Remove geometric consistency entirely |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | loss_weights={ |
| | 'geometric_consistency': 0.0, # Disabled |
| | 'absolute_scale': 2.5, |
| | 'pose_geometric': 2.0, |
| | 'gradient_loss': 1.0, |
| | 'teacher_consistency': 0.5, |
| | }, |
| | ) |
| | ``` |
| |
|
| | **Metrics to Compare**: |
| |
|
| | - Final geometric consistency loss |
| | - BA agreement (reprojection error) |
| | - Absolute scale accuracy (vs LiDAR) |
| | - Multi-view reconstruction quality |
| |
|
| | #### 2. Teacher-Student Ablation |
| |
|
| | **Question**: Does EMA teacher provide training stability and better convergence? |
| |
|
| | ```python |
| | # Baseline: With EMA teacher (default ema_decay=0.999) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | ema_decay=0.999, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Ablation 1: No teacher-student (ema_decay=0.0) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | ema_decay=0.0, # No EMA updates |
| | loss_weights={ |
| | 'geometric_consistency': 3.0, |
| | 'absolute_scale': 2.5, |
| | 'pose_geometric': 2.0, |
| | 'gradient_loss': 1.0, |
| | 'teacher_consistency': 0.0, # Disable teacher loss |
| | }, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Ablation 2: Faster teacher updates (ema_decay=0.99) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | ema_decay=0.99, # Faster updates |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Ablation 3: Slower teacher updates (ema_decay=0.9999) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | ema_decay=0.9999, # Slower updates |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | ``` |
| |
|
| | **Metrics to Compare**: |
| |
|
| | - Training stability (loss variance) |
| | - Convergence speed |
| | - Final model quality |
| | - Teacher-student consistency loss |
| |
|
| | #### 3. Oracle Source Ablation (BA vs ARKit) |
| |
|
| | **Question**: How much does BA refinement improve over ARKit poses? |
| |
|
| | ```bash |
| | # Baseline: Use BA when ARKit quality < 0.8 (default) |
| | ylff preprocess arkit data/arkit_sequences \ |
| | --output-cache cache/preprocessed-ba \ |
| | --prefer-arkit-poses --min-arkit-quality 0.8 |
| | |
| | ylff train unified cache/preprocessed-ba \ |
| | --use-wandb --wandb-project ylff-ablations |
| | |
| | # Ablation 1: Always use ARKit (no BA, faster preprocessing) |
| | ylff preprocess arkit data/arkit_sequences \ |
| | --output-cache cache/preprocessed-arkit-only \ |
| | --prefer-arkit-poses --min-arkit-quality 0.0 |
| | |
| | ylff train unified cache/preprocessed-arkit-only \ |
| | --use-wandb --wandb-project ylff-ablations |
| | |
| | # Ablation 2: Always use BA (expensive but highest quality) |
| | ylff preprocess arkit data/arkit_sequences \ |
| | --output-cache cache/preprocessed-ba-always \ |
| | --prefer-arkit-poses --min-arkit-quality 1.0 # Never use ARKit |
| | |
| | ylff train unified cache/preprocessed-ba-always \ |
| | --use-wandb --wandb-project ylff-ablations |
| | ``` |
| |
|
| | **Metrics to Compare**: |
| |
|
| | - Pose accuracy (reprojection error) |
| | - Training data quality (confidence scores) |
| | - Final model performance |
| | - Preprocessing time cost |
| |
|
| | #### 4. Uncertainty Weighting Ablation |
| |
|
| | **Question**: Does confidence-weighted loss improve training vs uniform weighting? |
| |
|
| | ```bash |
| | # Baseline: With uncertainty weighting (default) |
| | # Uses depth_confidence and pose_confidence from preprocessing |
| | |
| | # Ablation: Uniform weighting (ignore uncertainty) |
| | # Modify preprocessing to set all confidence = 1.0 |
| | # Or modify loss computation to ignore confidence maps |
| | ``` |
| |
|
| | **Metrics to Compare**: |
| |
|
| | - Loss on high-confidence vs low-confidence regions |
| | - Model performance on uncertain scenes |
| | - Training stability |
| |
|
| | #### 5. Multi-View Consistency Ablation |
| |
|
| | **Question**: How many views are needed for effective geometric consistency? |
| |
|
| | ```python |
| | # Baseline: Variable views (2-18, default from dataset) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, # Uses all available views |
| | epochs=200, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Ablation 1: Single view only (disable geometric consistency) |
| | train_ylff( |
| | model=model, |
| | dataset=single_view_dataset, # Modified dataset with 1 view |
| | epochs=200, |
| | loss_weights={ |
| | 'geometric_consistency': 0.0, # Disabled (needs 2+ views) |
| | 'absolute_scale': 2.5, |
| | 'pose_geometric': 2.0, |
| | 'gradient_loss': 1.0, |
| | 'teacher_consistency': 0.5, |
| | }, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Ablation 2-4: Fixed N views |
| | # Modify dataset to sample exactly N views per sequence |
| | # Compare: 2 views, 5 views, 10 views, 18 views |
| | ``` |
| |
|
| | **Metrics to Compare**: |
| |
|
| | - Geometric consistency loss |
| | - Multi-view reconstruction accuracy |
| | - Training efficiency (more views = slower) |
| |
|
| | #### 6. DA3 Techniques Ablation |
| |
|
| | **Question**: Which DA3 techniques contribute most? |
| |
|
| | ```python |
| | # Baseline: All DA3 techniques enabled |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Ablation 1: No gradient loss (DA3 edge preservation) |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | loss_weights={ |
| | 'geometric_consistency': 3.0, |
| | 'absolute_scale': 2.5, |
| | 'pose_geometric': 2.0, |
| | 'gradient_loss': 0.0, # Disabled |
| | 'teacher_consistency': 0.5, |
| | }, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Ablation 2: No depth-ray representation |
| | # Use model that outputs separate depth + poses instead of depth-ray |
| | # (Requires different model architecture) |
| | |
| | # Ablation 3: Fixed resolution (no multi-resolution training) |
| | # Modify dataset to use fixed resolution instead of variable |
| | ``` |
| |
|
| | **Metrics to Compare**: |
| |
|
| | - Depth edge quality (gradient loss ablation) |
| | - Training efficiency (multi-resolution ablation) |
| | - Model generalization |
| |
|
| | #### 7. Preprocessing Phase Ablation |
| |
|
| | **Question**: How much does the two-phase pipeline improve training efficiency? |
| |
|
| | ```bash |
| | # Baseline: With preprocessing (fast training) |
| | ylff preprocess arkit data/arkit_sequences --output-cache cache/preprocessed |
| | ylff train unified cache/preprocessed \ |
| | --use-wandb --wandb-project ylff-ablations \ |
| | --wandb-name baseline-with-preprocessing |
| | |
| | # Ablation: Live BA during training (slow but no preprocessing) |
| | # This would require modifying training to compute BA on-the-fly |
| | # Compare: Training time per epoch, total training time |
| | ``` |
| |
|
| | **Metrics to Compare**: |
| |
|
| | - Training time per epoch |
| | - Total training time |
| | - Model quality (should be similar, preprocessing is just optimization) |
| |
|
| | #### 8. Loss Component Contribution Analysis |
| |
|
| | **Question**: Which loss component contributes most to final model quality? |
| |
|
| | Run systematic sweeps using W&B sweeps or Python script: |
| |
|
| | ```python |
| | # sweep_config.yaml |
| | program: train_ablation_sweep.py |
| | method: grid |
| | parameters: |
| | loss_weight_geometric_consistency: |
| | values: [0.0, 1.0, 2.0, 3.0, 4.0] |
| | loss_weight_absolute_scale: |
| | values: [0.0, 1.0, 2.0, 2.5, 3.0] |
| | loss_weight_pose_geometric: |
| | values: [0.0, 1.0, 2.0, 3.0] |
| | loss_weight_gradient_loss: |
| | values: [0.0, 0.5, 1.0, 1.5] |
| | loss_weight_teacher_consistency: |
| | values: [0.0, 0.25, 0.5, 0.75, 1.0] |
| | |
| | # train_ablation_sweep.py |
| | import wandb |
| | from ylff.services.ylff_training import train_ylff |
| | |
| | wandb.init() |
| | config = wandb.config |
| | |
| | train_ylff( |
| | model=model, |
| | dataset=dataset, |
| | epochs=200, |
| | loss_weights={ |
| | 'geometric_consistency': config.loss_weight_geometric_consistency, |
| | 'absolute_scale': config.loss_weight_absolute_scale, |
| | 'pose_geometric': config.loss_weight_pose_geometric, |
| | 'gradient_loss': config.loss_weight_gradient_loss, |
| | 'teacher_consistency': config.loss_weight_teacher_consistency, |
| | }, |
| | use_wandb=True, |
| | wandb_project="ylff-ablations", |
| | ) |
| | |
| | # Run: wandb sweep sweep_config.yaml |
| | ``` |
| |
|
| | **Analysis**: |
| |
|
| | - Use W&B parallel coordinates plot to find optimal weight combinations |
| | - Identify which components are essential vs optional |
| | - Find Pareto frontier (best quality for given training time) |
| |
|
| | #### Recommended Ablation Order |
| |
|
| | 1. **Start with Loss Weight Ablations** (#1) - Most fundamental to our approach |
| | 2. **Teacher-Student Ablation** (#2) - Validates DINOv2 adaptation |
| | 3. **Oracle Source Ablation** (#3) - Validates preprocessing strategy |
| | 4. **Component Contribution** (#8) - Systematic analysis |
| | 5. **DA3 Techniques** (#6) - Validates DA3 integration |
| | 6. **Multi-View Consistency** (#5) - Optimizes training efficiency |
| | 7. **Uncertainty Weighting** (#4) - Fine-tuning |
| | 8. **Preprocessing Phase** (#7) - Efficiency validation |
| |
|
| | Each ablation should be run with: |
| |
|
| | - Same random seed (for reproducibility) |
| | - Same dataset split |
| | - Same number of epochs |
| | - W&B tracking enabled for easy comparison |
| |
|
| | ## Training Datasets |
| |
|
| | Depth Anything 3 (DA3) was trained exclusively on **public academic datasets**. The following table documents all datasets used in DA3 training, their sources, and availability status for YLFF: |
| |
|
| | | Dataset | # Scenes | Data Type | Source / URL | YLFF Status | Notes | |
| | | ------------------------------------ | -------- | --------- | ----------------------------------------------------------------------------------------------- | ---------------- | ------------------------------ | |
| | | **Synthetic Datasets** | |
| | | AriaDigitalTwin | 237 | Synthetic | [Aria Digital Twin](https://github.com/facebookresearch/AriaDigitalTwin) | ❌ Not Available | Meta's AR dataset | |
| | | AriaSyntheticENV | 99,950 | Synthetic | [Aria Synthetic](https://github.com/facebookresearch/AriaDigitalTwin) | ❌ Not Available | Large-scale synthetic AR | |
| | | HyperSim | 344 | Synthetic | [HyperSim](https://github.com/apple/ml-hypersim) | ❌ Not Available | Apple's photorealistic dataset | |
| | | MegaSynth | 6,049 | Synthetic | Unknown | ❓ To Verify | Synthetic multi-view | |
| | | MvsSynth | 121 | Synthetic | Unknown | ❓ To Verify | Multi-view stereo synthetic | |
| | | Objaverse | 505,557 | Synthetic | [Objaverse](https://objaverse.allenai.org/) | ❓ To Verify | Large-scale 3D objects | |
| | | Omniobject | 5,885 | Synthetic | [OmniObject3D](https://omniobject3d.github.io/) | ❓ To Verify | Object-centric dataset | |
| | | OmniWorld | 1,039 | Synthetic | [OmniWorld](https://arxiv.org/abs/2509.12201) | ❓ To Verify | Multi-domain dataset | |
| | | PointOdyssey | 44 | Synthetic | [PointOdyssey](https://pointodyssey.com/) | ❓ To Verify | Long-term point tracking | |
| | | ReplicaVMAP | 17 | Synthetic | [Replica](https://github.com/facebookresearch/Replica-Dataset) | ❓ To Verify | Indoor scene dataset | |
| | | ScenenetRGBD | 16,866 | Synthetic | [SceneNet RGB-D](https://robotvault.bitbucket.io/scenenet-rgbd.html) | ❓ To Verify | Indoor RGB-D scenes | |
| | | TartanAir | 355 | Synthetic | [TartanAir](https://theairlab.org/tartanair-dataset/) | ❓ To Verify | Large-scale simulation | |
| | | Trellis | 557,408 | Synthetic | Unknown | ❓ To Verify | Large-scale synthetic | |
| | | vKitti2 | 50 | Synthetic | [vKITTI2](https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds-vkitti-2/) | ❓ To Verify | Virtual KITTI | |
| | | **Real-World Datasets (LiDAR)** | |
| | | ARKitScenes | 4,388 | LiDAR | [ARKitScenes](https://github.com/apple/ARKitScenes) | ✅ **Available** | **Primary dataset for YLFF** | |
| | | ScanNet++ | 230 | LiDAR | [ScanNet++](https://github.com/ScanNet/ScanNetPlusPlus) | ❓ To Verify | High-fidelity indoor | |
| | | WildRGBD | 23,050 | LiDAR | [WildRGBD](https://wildrgbd.github.io/) | ❓ To Verify | Large-scale RGB-D | |
| | | **Real-World Datasets (COLMAP/SfM)** | |
| | | BlendedMVS | 503 | 3D Recon | [BlendedMVS](https://github.com/YoYo000/BlendedMVS) | ❓ To Verify | Multi-view stereo | |
| | | Co3dv2 | 30,616 | COLMAP | [Common Objects in 3D](https://github.com/facebookresearch/co3d) | ❓ To Verify | Object-centric | |
| | | DL3DV | 6,379 | COLMAP | [DL3DV-10K](https://github.com/OpenGVLab/DL3DV) | ❓ To Verify | Large-scale 3D vision | |
| | | MapFree | 921 | COLMAP | [Map-free Visual Relocalization](https://github.com/nianticlabs/map-free-reloc) | ❓ To Verify | Visual relocalization | |
| | | MegaDepth | 268 | COLMAP | [MegaDepth](https://www.cs.cornell.edu/projects/megadepth/) | ❓ To Verify | Internet photos | |
| |
|
| | **Legend:** |
| |
|
| | - ✅ **Available**: Dataset is accessible and can be used for YLFF training |
| | - ❌ **Not Available**: Dataset is not accessible (proprietary, requires special access, etc.) |
| | - ❓ **To Verify**: Dataset availability needs to be confirmed |
| |
|
| | ### Dataset Statistics |
| |
|
| | **Total Training Data:** |
| |
|
| | - **Synthetic**: ~1,093,000 scenes (majority from Objaverse and Trellis) |
| | - **Real-World LiDAR**: ~27,668 scenes (ARKitScenes, ScanNet++, WildRGBD) |
| | - **Real-World COLMAP**: ~38,687 scenes (BlendedMVS, Co3dv2, DL3DV, MapFree, MegaDepth) |
| | - **Total**: ~1,159,355 scenes |
| |
|
| | **Data Type Distribution:** |
| |
|
| | - **Synthetic**: 94.3% (provides high-quality dense depth) |
| | - **LiDAR**: 2.4% (provides metric accuracy) |
| | - **COLMAP/SfM**: 3.3% (provides multi-view geometry) |
| |
|
| | ### YLFF Dataset Strategy |
| |
|
| | YLFF currently focuses on **ARKitScenes** as the primary training dataset because: |
| |
|
| | 1. ✅ **Available**: Publicly accessible dataset |
| | 2. ✅ **High Quality**: LiDAR depth provides metric accuracy |
| | 3. ✅ **Real-World**: Captures real indoor scenes with natural variations |
| | 4. ✅ **Rich Metadata**: Includes poses, intrinsics, and LiDAR depth |
| | 5. ✅ **Large Scale**: 4,388 scenes provide substantial training data |
| |
|
| | **Future Dataset Integration:** |
| |
|
| | - Priority: ScanNet++, WildRGBD (LiDAR datasets for metric accuracy) |
| | - Secondary: DL3DV, Co3dv2 (COLMAP datasets for multi-view geometry) |
| | - Synthetic: Consider for teacher model training (if accessible) |
| |
|
| | ### Dataset Access Notes |
| |
|
| | - **ARKitScenes**: Download from [official repository](https://github.com/apple/ARKitScenes) |
| | - **ScanNet++**: Requires registration and approval |
| | - **COLMAP datasets**: Most are publicly available but may require preprocessing |
| | - **Synthetic datasets**: Many require special access or are proprietary |
| |
|
| | For detailed dataset preparation and preprocessing instructions, see `docs/DATASET_PREPARATION.md` (to be created). |
| |
|
| | ### Loss Components |
| |
|
| | The training uses geometric losses as the primary objective: |
| |
|
| | 1. **Multi-View Geometric Consistency** (weight: 3.0) |
| |
|
| | - Enforces that the same 3D point projects correctly across views |
| | - Uses back-projection + projection across multiple views |
| | - **This is treated as a first-order objective, not regularization** |
| |
|
| | 2. **Absolute Scale Loss** (weight: 2.5) |
| |
|
| | - Direct supervision from LiDAR/BA depth |
| | - Enforces correct absolute depth values in meters |
| | - Critical for metric accuracy |
| |
|
| | 3. **Pose Geometric Loss** (weight: 2.0) |
| |
|
| | - Reprojection error using predicted poses |
| | - Enforces geometric consistency between poses and depth |
| | - Multi-view pose consistency is paramount |
| |
|
| | 4. **Gradient Loss** (weight: 1.0) |
| |
|
| | - Preserves sharp depth boundaries |
| | - Ensures smoothness in planar regions |
| | - DA3 technique for better depth quality |
| |
|
| | 5. **Teacher-Student Consistency** (weight: 0.5) |
| | - L1 loss between student and teacher predictions |
| | - Encourages stable training |
| | - Prevents student from diverging |
| |
|
| | ## Project Structure |
| |
|
| | ``` |
| | ylff/ |
| | ├── ylff/ # Main package |
| | │ ├── services/ # Business logic |
| | │ │ ├── ylff_training.py # ⭐ Unified training service |
| | │ │ ├── preprocessing.py # Offline preprocessing (BA, uncertainty) |
| | │ │ ├── preprocessed_dataset.py # Dataset for pre-computed results |
| | │ │ ├── ba_validator.py # BA validation pipeline |
| | │ │ ├── arkit_processor.py # ARKit data processing |
| | │ │ ├── evaluate.py # Evaluation metrics |
| | │ │ └── ... # Other services |
| | │ │ |
| | │ ├── utils/ # Utilities |
| | │ │ ├── geometric_losses.py # Geometric loss functions |
| | │ │ ├── oracle_uncertainty.py # Oracle uncertainty propagation |
| | │ │ ├── oracle_losses.py # Oracle-weighted losses |
| | │ │ └── ... # Other utilities |
| | │ │ |
| | │ ├── routers/ # FastAPI route handlers |
| | │ ├── models/ # Pydantic API models |
| | │ └── cli.py # Command-line interface |
| | │ |
| | ├── configs/ # Configuration files |
| | │ ├── dinov2_train_config.yaml # Training configuration |
| | │ └── ba_config.yaml # BA pipeline configuration |
| | │ |
| | ├── docs/ # Documentation |
| | │ ├── UNIFIED_TRAINING.md # Unified training guide |
| | │ ├── TRAINING_PIPELINE_ARCHITECTURE.md |
| | │ └── ... # Other documentation |
| | │ |
| | └── research_docs/ # Research documentation |
| | └── MODEL_ARCH.md # Model architecture details |
| | ``` |
| |
|
| | ## CLI Commands |
| |
|
| | ### Preprocessing |
| |
|
| | - `ylff preprocess arkit <dir>` - Pre-process ARKit sequences (offline) |
| |
|
| | ### Training |
| |
|
| | - `ylff train unified <cache_dir>` - Train using unified training service |
| |
|
| | ### Validation |
| |
|
| | - `ylff validate sequence <dir>` - Validate a single sequence |
| | - `ylff validate arkit <dir> [--gui]` - Validate ARKit data (with optional GUI) |
| |
|
| | ### Evaluation |
| |
|
| | - `ylff eval ba-agreement <dir>` - Evaluate model agreement with BA |
| |
|
| | ### Visualization |
| |
|
| | - `ylff visualize <results_dir>` - Generate static visualizations |
| |
|
| | ## Complete Workflow |
| |
|
| | ### Step 1: Pre-Process All Sequences |
| |
|
| | ```bash |
| | # Pre-process all ARKit sequences (one-time, can run overnight) |
| | ylff preprocess arkit data/arkit_sequences \ |
| | --output-cache cache/preprocessed \ |
| | --model-name depth-anything/DA3-LARGE \ |
| | --num-workers 8 \ |
| | --prefer-arkit-poses \ |
| | --use-lidar |
| | ``` |
| |
|
| | This: |
| |
|
| | - Extracts ARKit data (poses, LiDAR depth) - FREE |
| | - Runs DA3 inference (GPU, batchable) |
| | - Runs BA only for sequences with poor ARKit tracking |
| | - Computes oracle uncertainty |
| | - Saves everything to cache |
| |
|
| | ### Step 2: Train with Unified Service |
| |
|
| | ```bash |
| | # Train using pre-computed results (fast iteration) |
| | ylff train unified cache/preprocessed \ |
| | --model-name depth-anything/DA3-LARGE \ |
| | --epochs 200 \ |
| | --lr 2e-4 \ |
| | --batch-size 32 \ |
| | --checkpoint-dir checkpoints \ |
| | --use-wandb \ |
| | --wandb-project ylff-training |
| | ``` |
| |
|
| | This: |
| |
|
| | - Loads pre-computed oracle results (fast, disk I/O) |
| | - Runs DA3 inference (current model, GPU) |
| | - Computes geometric losses (primary objective) |
| | - Updates model weights with teacher-student learning |
| |
|
| | ### Step 3: Evaluate |
| |
|
| | ```bash |
| | # Evaluate fine-tuned model |
| | ylff eval ba-agreement data/test \ |
| | --checkpoint checkpoints/best_model.pt |
| | ``` |
| |
|
| | ## Configuration |
| |
|
| | Configuration files are in `configs/`: |
| |
|
| | - `dinov2_train_config.yaml` - Unified training configuration |
| |
|
| | - Optimizer settings (DINOv2 style) |
| | - Loss weights (geometric consistency first) |
| | - Teacher-student settings |
| | - Multi-resolution and multi-view training |
| |
|
| | - `ba_config.yaml` - BA pipeline settings |
| |
|
| | ## Documentation |
| |
|
| | - **Unified Training**: `docs/UNIFIED_TRAINING.md` - Complete guide to unified training |
| | - **Training Pipeline**: `docs/TRAINING_PIPELINE_ARCHITECTURE.md` - Two-phase pipeline architecture |
| | - **Model Architecture**: `research_docs/MODEL_ARCH.md` - Detailed architecture and training approach |
| | - **API Documentation**: `docs/API.md` - API reference |
| | - **ARKit Integration**: `docs/ARKIT_INTEGRATION.md` - ARKit data processing |
| |
|
| | ## Key Design Decisions |
| |
|
| | ### Why Geometric Consistency First? |
| |
|
| | Traditional depth estimation models prioritize perceptual quality (how realistic the depth looks) over geometric accuracy (how accurate the absolute scale and multi-view consistency are). YLFF reverses this priority: |
| |
|
| | - **Geometric consistency** ensures that the same 3D point projects correctly across views |
| | - **Absolute scale** ensures metric accuracy (depth in meters, not just relative) |
| | - **Pose consistency** ensures that predicted poses align with depth predictions |
| |
|
| | This approach is essential for applications requiring accurate 3D reconstruction, SLAM, and metric depth estimation. |
| |
|
| | ### Why Two-Phase Pipeline? |
| |
|
| | BA computation is expensive (5-15 minutes per sequence) and cannot run during training. The two-phase pipeline: |
| |
|
| | 1. **Pre-processing** (offline): Compute BA once, cache results |
| | 2. **Training** (online): Load cached results, train fast |
| |
|
| | This enables 100-1000x faster training iteration while still using BA as supervision. |
| |
|
| | ### Why Teacher-Student Learning? |
| |
|
| | DINOv2's teacher-student paradigm provides: |
| |
|
| | - **Stability**: EMA teacher prevents training instability |
| | - **Better convergence**: Teacher provides stable targets |
| | - **Scalability**: Works well with large-scale training |
| |
|
| | ## Development |
| |
|
| | ### Running Tests |
| |
|
| | ```bash |
| | # Basic smoke test |
| | python scripts/tests/smoke_test_basic.py |
| | |
| | # GUI test |
| | python scripts/tests/test_gui_simple.py |
| | ``` |
| |
|
| | ### Code Quality |
| |
|
| | ```bash |
| | # Format code |
| | black ylff/ scripts/ |
| | |
| | # Sort imports |
| | isort ylff/ scripts/ |
| | |
| | # Type checking |
| | mypy ylff/ |
| | ``` |
| |
|
| | ## Dependencies |
| |
|
| | ### Core Dependencies |
| |
|
| | - PyTorch >= 2.0 |
| | - NumPy < 2.0 |
| | - OpenCV |
| | - pycolmap >= 0.4.0 |
| | - Typer (for CLI) |
| |
|
| | ### Optional Dependencies |
| |
|
| | - **GUI**: Plotly (for interactive 3D plots) |
| | - **BA Pipeline**: hloc, LightGlue (installed from source) |
| | - **Training**: Weights & Biases (for experiment tracking) |
| |
|
| | See `pyproject.toml` for complete dependency list. |
| |
|
| | ## License |
| |
|
| | Apache-2.0 |
| |
|
| | ## Citation |
| |
|
| | If you use YLFF in your research, please cite: |
| |
|
| | ```bibtex |
| | @software{ylff2024, |
| | title={You Learn From Failure: Geometric Consistency First Training for Visual Geometry}, |
| | author={YLFF Contributors}, |
| | year={2024}, |
| | url={https://github.com/your-org/ylff} |
| | } |
| | ``` |
| |
|
| | ## References |
| |
|
| | - **DINOv2**: https://github.com/facebookresearch/dinov2 |
| | - **DA3 Paper**: Depth Anything 3 (arXiv:2511.10647) |
| | - **Unified Training**: `ylff/services/ylff_training.py` |
| | - **Model Architecture**: `research_docs/MODEL_ARCH.md` |
| |
|